Many people feel it's time AI companies paid for the free data lunches that have made their generative systems big and strong.
Recently, a bevy of legal action demanding compensation from AI companies has been filed in the U.S. and Europe. The plaintiffs include authors and artists, who have consistently expressed concern about AI stealing their work and producing mediocre derivatives.
An open letter from the Authors Guild -- signed by more than 8,500 authors, including Margaret Atwood, Dan Brown and Jodi Picoult -- urges tech companies responsible for generative AI applications, such as ChatGPT and Bard, to cease using their works without proper authorization or compensation. The authors want companies to pay for the data they scraped for training -- the "food" for AI systems, endless meals for which there has been no bill.
Authors also express concern that generative AI threatens their profession by flooding the market with machine-written content based on their work. This was a problem in recent months as Amazon took action against AI authors spamming the bestseller list with generated works.
Prior to the release of the Authors Guild letter, two North American authors -- Mona Awad and Paul Tremblay -- filed a lawsuit against OpenAI, claiming the organization breached copyright law. The suit argued that OpenAI breached copyright law because ChatGPT generated accurate summaries of the author's works and, therefore, must've trained on the authors' works.
They aren't the only ones. Author and comedian Sarah Silverman is also suing OpenAI and Meta for illegally reproducing her memoir, The Bedwetter, without permission. However, that argument may not hold up in court because of the way generative AI works.
What is generative AI?
Generative AI is the technology that powers ChatGPT and Bard. Text-based generative AI uses algorithms to predict the likely next words in text and generates that text based on a prompt from the user. ChatGPT knows what to generate because it was trained on a large corpus of publicly available data from the internet. It learned patterns from the training and matches those patterns to prompts from the user.
Generative AIs are usually black box AI systems, meaning nobody -- not even the programmers -- understands the exact steps the machine takes to go from input to output. Input goes in, the magic happens and output comes out.
All machine learning and generative AI tools use preexisting works of some kind.
Why are people suing?
People are suing AI companies over copyright. Even though ChatGPT's trained on data from the internet, it does so without permission from the data creators. For example, GPT-3 was trained on Wikipedia and Reddit, among other sources. However, conversations about and segments of copyrighted works could exist in the training material and give large language models enough context to accurately summarize those copyrighted works.
On a larger scale, people are suing because AI is a black box, and it's impossible to know how it works on a granular level. The fear is that people will use AI to avoid taking responsibility for their decisions or the things it produces.
"If AI companies are allowed to market AI systems that are essentially black boxes, they could become the ultimate ends-justify-the-means devices," Matthew Butterick, one of the lawyers behind several of the lawsuits, wrote in his blog. "Before too long, we will not delegate decisions to AI systems because they perform better. Rather, we will delegate decisions to AI systems because they can get away with everything that we can't."
What AI lawsuits have been filed?
Numerous cases have been brought against generative AI companies regarding copyright and misuse. Here are some of the companies being sued.
GitHub, Microsoft and OpenAI
A class-action suit was filed against these companies involving GitHub's Copilot tool. The tool predictively generates code based on what the programmer has already written. The plaintiffs allege that Copilot copies and republishes code from GitHub without abiding by the requirements of GitHub's open source license, such as failing to provide attribution. The complaint also includes claims related to GitHub's mishandling of personal data and information, as well as claims of fraud. The complaint was filed in November 2022. Microsoft and GitHub have repeatedly tried to get the case dismissed.
Stability AI, Midjourney and DeviantArt
A complaint against these AI image generator providers was filed in January 2023. The plaintiffs alleged the systems directly infringe on plaintiffs' copyrights by training on works created by the plaintiffs and creating unauthorized derivative works. The complaint also takes issue with the fact that the tools can be used to generate work in the style of artists. The judge on the case, William Orrick, said he was inclined to dismiss the lawsuit.
In January 2023, Getty Images issued a complaint against Stability AI for allegedly copying and processing millions of images and associated metadata owned by Getty in the U.K. Getty filed another lawsuit against Stability AI in the U.S. District Court for the District of Delaware days later, which raised many copyright- and trademark-related claims, and pointed to "bizarre or grotesque" generated images that contained the Getty Images watermark and, therefore, damaged Getty's reputation.
Authors Paul Tremblay and Mona Awad are suing OpenAI for allegedly infringing on authors' copyrights. Butterick is one of the attorneys representing the authors. The complaint estimated that more than 300,000 books were copied in OpenAI's training data. The suit seeks an unspecified amount of money. The case was filed in June 2023.
Meta and OpenAI
Sarah Silverman's lawsuit against Meta and OpenAI alleged copyright infringement and said ChatGPT and Large Language Model Meta AI (Llama) were trained on illegally acquired data sets with her work contained. The suit alleges the books were acquired from shadow libraries, such as Library Genesis, Z-Library and Bibliotek, where the books can be torrented. Torrenting is a common method of downloading files without proper legal permission. Specifically, Meta's language model, Llama, was trained on a data set called the Pile, which uses data from Bibliotek, according to a paper from EleutherAI, the company that assembled the Pile. The suit was filed in July 2023.
A class-action lawsuit is being brought against Google for alleged misuse of personal information and copyright infringement. Some of the data specified in the lawsuit includes photos from dating websites, Spotify playlists, TikTok videos and books used to train Bard. The lawsuit, filed in July 2023, said Google could owe at least $5 billion. The plaintiffs have elected to remain anonymous.
These copyright cases against big tech companies aren't the first of their kind. In 2015, the Author's Guild sued Google for making digital copies of millions of books and providing snippets of them to the public. The court ultimately favored Google, saying the works were transformative and did not provide a market substitute for the books.
What questions do these cases address?
The above lawsuits will be important in answering the following questions:
- Does training a model on copyrighted material require a license? Generative AI systems make copies of the training materials as part of the training process. Does that interim copying require a license, or is it fair use?
- Does generative AI output infringe on copyright for the materials on which the model was trained? If generative output constitutes a derivative work or infringes the training data's reproduction right, then it infringes on copyright. Courts will need to rule whether similarities in output and training data are derived from protected materials or unprotected materials. Who is liable for copyright infringement when AI infringes?
- Does generative AI violate restrictions on removing, altering or falsifying copyright management information? The Digital Millenium Copyright Act provides restrictions on removal or alteration of copyright management information, such as watermarks. This is exemplified in the Stability AI case, where the watermark reproduced by Stable Diffusion on generated works constituted false copyright management information.
- Does generating work in the style of someone violate that person's rights? This is known as the right of publicity, which varies from state to state. It prohibits the use of someone's likeness, name, image, voice or signature for commercial gain.
- How do open source licenses apply to training AI models and distributing the resulting output? The plaintiffs in the Copilot case argued that republishing Copilot training materials without attribution -- and not making Copilot itself open source -- violates open source license terms.
As the cases continue to take shape and answers emerge, companies involved with generative AI tools should watch for guidance around the intersection of AI and intellectual property and check to see if they need risk mitigation strategies.