Getty Images


Examining the future of AI and open source software

As AI coding tools gain traction in the enterprise, it remains unclear whether AI-generated code violates open source software licenses -- but legal claims indicate possible risk.

Many of today's generative AI tools and systems train on vast amounts of internet data. Because some of that data comes from open source code repositories, issues surrounding open source licenses -- including potential violations -- could increasingly come into play regarding AI training data and output.

Using generative AI effectively in the enterprise requires grappling with a range of challenges, such as ensuring adequate data quality when training models on custom data, properly fine-tuning models and mitigating generative AI security risks. Another challenge, seemingly easy to overlook but potentially profound, is open source software licensing.

Although it remains unclear under which circumstances generative AI technology might violate open source software licenses, it's plausible that courts will find it does, leading to legal and compliance risks for businesses that rely on certain generative AI models. Organizations should know how open source licenses affect AI, the status of lawsuits involving this issue and how stakeholders -- including both open source developers and businesses using generative AI -- can adapt to protect their interests.

LLMs and open source software licensing

Most of the major large language models available today, such as those developed by OpenAI, Meta and Google, are trained on vast troves of data collected from the internet. This includes open source software code available on websites like GitHub.

Training on open source code enables models to generate new code, a key feature of prominent AI products like GitHub Copilot, ChatGPT and Claude. However, to what extent that code is actually "new" is up for debate; one could argue that the source code generated by AI tools isn't brand new, but rather a regurgitation of the code the model was trained on.

Most open source software is available to use and modify free of cost, so no one is arguing that LLM developers should have paid open source projects to scan their code. However, open source code is governed by licenses that impose requirements other than payment.

The terms vary across the dozens of open source licenses in existence today, but a common requirement is that developers who modify open source code and use the modified version of the software publicly must also release the modified version of their source code to the public. Some licenses also require practices like citing the original authors of open source code when releasing software that incorporates or reuses that code.

Flow chart for choosing an open source license, guiding users through permissive (Apache 2.0, MIT) and copyleft (LGPL v3, GPL v3) options.
Different open source licenses have different requirements around issues such as patents and commercial use.

The impact of open source software licenses on AI

Historically, open source licenses were designed with the assumption that reusing or modifying code involved developers inserting the code into applications they were otherwise coding from scratch. Complying with open source licenses under those conditions is straightforward.

But when using an LLM to help develop software, the definition of open source reuse is murkier. Say a developer uses an LLM to generate code for an application, and the model's ability to generate that code hinges on its training with open source software. Does that mean the developer is reusing that open source software because they used an LLM to help write their application's code?

Some argue that the answer is yes, and consequently, applications developed with assistance from AI must comply with the software licenses governing the code that the AI trained on. Others, however, contend that open source licensing requirements shouldn't apply because LLMs aren't reusing open source code in the conventional sense. Although LLMs were trained on open source code, the argument goes, they don't actually create verbatim copies of the code -- except when they do, which is rare, but reportedly happens.

This issue is relevant whether developers use an open source LLM or a proprietary LLM, with the latter describing one based on closed source algorithms. That's because the debate focuses not on the code within LLMs themselves, but rather the data they were trained on.

The uncertainty of open source license violation

Determining whether LLMs violate open source software licenses is currently difficult for several reasons:

  • Lack of legal precedents. To date, no court has ruled on whether training an LLM using open source code qualifies as open source reuse or modification.
  • Ambiguity of reuse. There are many ways to interact with LLMs that train on open source code, each with different implications for whether developers must comply with licenses governing the original code. Developers could ask LLMs to write "new" source code and incorporate it directly into an app without making any modifications. They could also use LLM-generated code as guidance, but ultimately write their own code. Alternatively, they could do something in between, modifying some parts of LLM-generated code while reusing other parts verbatim.
  • Lack of training data transparency. Most pretrained LLMs, including many open source ones, don't make their training data public knowledge. Without that information, it's impossible to know which open source code a model trained on and thus which licenses apply.

Because of these uncertainties, it's impossible to say at present whether using generative AI to help write software constitutes an open source license violation. In addition, the specific ways that developers use AI are likely to matter. For example, courts might determine that it's permissible to use AI if developers modify AI-generated code to a certain extent. Similarly, developers might be required to prove that the model they're using was trained only on open source code not governed by strict licenses. This is currently difficult because there is typically little transparency about which training data a model used.

Lawsuits involving open source licenses and AI

So far, one major lawsuit has emerged based on allegations that generative AI services violate open source licenses.

The suit, colloquially known as GitHub Copilot Intellectual Property Litigation, was filed in late 2022 by the Joseph Saveri Law Firm and Matthew Butterick. It alleges that Microsoft and OpenAI profited "from the work of open-source programmers by violating the conditions of their open-source licenses." There is not yet any public indication of how the court might rule in this case or the importance of the precedent the ruling might set.

The New York Times has also filed a lawsuit over AI use that, while not directly related to open source software licenses, could set a relevant precedent. The Times alleged that Microsoft and OpenAI violated its copyrights by training AI models on newspaper content owned by the Times without permission.

If a court were to agree, the decision might suggest that output produced by AI models trained on certain data is essentially a form of reuse of that data. Therefore, copyright or licensing terms that apply to the original data would also apply to data generated by the AI model. This would likely support claims that generative AI violates open source software licenses.

The Times case, like the GitHub case, is pending, with no indication so far of which direction the court might lean.

Navigating uncharted terrain

The lack of legal precedent makes it difficult for any stakeholders -- open source communities, AI developers or software developers who use AI tools -- to know how to act. Nonetheless, each group can take certain steps to help protect its interests.

Open source communities can create licenses that state in clear terms how developers of generative AI models can and can't use their code in model training. The open source space has a long history of adapting its licensing strategies to keep pace with technological change, and AI is just the newest chapter in that story.

AI developers can help by being more transparent about when and how they use open source code for training. They would also be wise to ensure that they are prepared to retrain their models without open source code if courts find that they violated open source licenses. It would be bad news for vendors like Microsoft and OpenAI if GitHub Copilot and similar tools suddenly became unusable due to licensing infringement.

As for developers who use generative AI coding tools, a practice worth considering is tracking which code in a codebase was produced by AI tools. If it becomes necessary in the future to remove that AI-generated code to avoid open source licensing violations, the ability to identify that code will be crucial.

It might also be smart for developers to avoid becoming too reliant on AI-assisted coding tools until the legality of such tools becomes clearer. Developers looking for a fast way to generate code without writing it from scratch might consider sticking with traditional low-code/no-code tools, which produce code without using LLMs and are therefore not facing claims of licensing violations.

Chris Tozzi is a freelance writer, research adviser, and professor of IT and society who has previously worked as a journalist and Linux systems administrator.

Dig Deeper on AI business strategies

Business Analytics
Data Management