Rewriting the code verbatim and distributing it would be a copyright infringemen...

supermatt · on March 21, 2023

> That's completely different from reading and learning from code, which is what grondo described.

AI (e.g. copilot) has already been shown to break copyright of material in its training set. Thats the context of this whole thread.

nickelpro · on March 21, 2023

Perhaps, but not of Grondo's point.

If an AI infringes on copyright then it infringes on copyright, that's unfortunate for the distributors of that code.

Humans accidentally infringe on copyright sometimes too. It's not a unique problem to machine learning. The potential to infringe on copyright has not made observing/learning/watching/reading copyright materials prohibited for humans, nor should it or (likely) will it become prohibited for machine learning algorithms.

supermatt · on March 21, 2023

> Perhaps, but not of Grondo's point.

Grondo said that AI should be given access to all code, including private and unlicensed code.

He was given a link to Clean Room Design demonstrating the problem with the same entity (the AI) reading and learning from the existing code and the risk of regurgitation when writing new code.

He goes on to say thats what he does, which doesn't change that fact.

> Humans accidentally infringe on copyright sometimes too.

Indeed we do, and its almost entirely unnoticed, even by the author.

> nor should it or (likely) will it become illegal for machine learning algorithms.

If those machine learning algorithms are taking in unlicensed material and then they later output unlicensed and/or copyrighted material, then they are a liability. Why would you want that when you can train it otherwise and be sure it NEVER infringes others IP? Its a no-brainer, surely. Or are you assuming there is some magic inherent in other peoples private code?

nickelpro · on March 21, 2023

> If those machine learning algorithms are taking in unlicensed material and then they later output unlicensed and/or copyrighted material, then they are a liability. Why would you want that when you can train it otherwise and be sure it NEVER infringes others IP?

Because it could produce a better model that produces better code.

You're now arguing a heavily reduced point. That a model that trained on proprietary code is at higher risk of reproducing infringing code is not a point under contention. The clean room serves the same purpose, it is a risk mitigation strategy.

Risk mitigation is a choice, left up to individuals. Maybe you use a clean room design, maybe you don't. Maybe you use a model trained on closed-source IP, maybe you don't. There are risks associated with these choices, but that is up to individuals to make.

The choice to observe closed source IP and learn from it shouldn't be prohibited just because some won't want to assume that risk.