Comment LLM is a programming language (Score 1) 47
The prompts provided to the LLM should be copyrightable as code and the code generated should be protected the same way compiled or intermediate code is protected.
The issue at hand is how the model was trained. And the users of the LLM should be made clearly aware by the model trainer whether the user or the model trainer is responsible for the liability related to using other peoples code for training the model.
That said, we should soon be seeing models that are trained using training courses rather than massive amounts of code. Once that happens, the models will make use of agents to search the web and learn from stackexchange or other sources how to solve problems the same way a human would. Of course, when a human learns how to do something from searching the web, simply copy/pasting other peoples code can be an issue and we have to read license restrictions. But learning how someone did something and doing it ourselves is generally safe. If a model reads an articles while searching and then learns how to do something, it should also be protected if it's not copies verbatim.
We have a lot of legalities to deal with.
1) Massive models are going to die. I don't know whether it's with wasting time with nonsense like OpenAI and Anthropic. They won't even be in business by the time the lawsuits come through.
2) Agentic models will be the focus of the future because they work more like humans. We give them the base information needed to learn and find the answers themselves. Using cloud based solutions where companies hosting the solutions keep massive amounts of data locally cached so the model can research faster could be an issue. But these will cost money and really just won't do more than local-AI will. They'll just be faster. For legality sake you'd want to avoid the cloud models since caching can be seen as theft. But, local-AI is much different. With agentic solutions, I think most legal issues are back to the same issues with copyright we always have. We just have to make sure our models which we use are following the copyright rules. If it's allowed, copy/paste. If it's not, then learn how it's done and make your own solution. In a perfect world, we'd then have a stackexchange or alternate github for AI generated code and post that different LLMs could use to learn from each other. The problem then becomes whether that would be seen as training a large model and whether they are in violation of copyright again.
The issue at hand is how the model was trained. And the users of the LLM should be made clearly aware by the model trainer whether the user or the model trainer is responsible for the liability related to using other peoples code for training the model.
That said, we should soon be seeing models that are trained using training courses rather than massive amounts of code. Once that happens, the models will make use of agents to search the web and learn from stackexchange or other sources how to solve problems the same way a human would. Of course, when a human learns how to do something from searching the web, simply copy/pasting other peoples code can be an issue and we have to read license restrictions. But learning how someone did something and doing it ourselves is generally safe. If a model reads an articles while searching and then learns how to do something, it should also be protected if it's not copies verbatim.
We have a lot of legalities to deal with.
1) Massive models are going to die. I don't know whether it's with wasting time with nonsense like OpenAI and Anthropic. They won't even be in business by the time the lawsuits come through.
2) Agentic models will be the focus of the future because they work more like humans. We give them the base information needed to learn and find the answers themselves. Using cloud based solutions where companies hosting the solutions keep massive amounts of data locally cached so the model can research faster could be an issue. But these will cost money and really just won't do more than local-AI will. They'll just be faster. For legality sake you'd want to avoid the cloud models since caching can be seen as theft. But, local-AI is much different. With agentic solutions, I think most legal issues are back to the same issues with copyright we always have. We just have to make sure our models which we use are following the copyright rules. If it's allowed, copy/paste. If it's not, then learn how it's done and make your own solution. In a perfect world, we'd then have a stackexchange or alternate github for AI generated code and post that different LLMs could use to learn from each other. The problem then becomes whether that would be seen as training a large model and whether they are in violation of copyright again.