whyframeshot - stock.adobe.com

OpenAI's fair use claim against The New York Times lawsuit

The AI vendor claims it is widely accepted to use materials from the internet to train AI models. It also suggested that the Times manipulated models to regurgitate old materials.

OpenAI is claiming fair use in its response to The New York Times' lawsuit against the ChatGPT creator.

The AI vendor on Jan. 8 answered claims that it used millions of Times articles to train automated chatbots.

OpenAI claims that not only does it have a history of collaborating with other news organizations but also that training AI models with publicly available materials on the internet is fair use and supported by numerous academics, civil groups and startups that submitted comments to the U.S. Copyright Office.

The fair use argument

While only the courts can determine fair use, what OpenAI is maintaining with the fair use argument is that even if it used The New York Times' copyrighted work in any way, the vendor is not guilty of copyright infringement because what it created is different from a newspaper. Rather, it is a chatbot, said Michael Bennett, responsible AI lead at the Institute for Experiential AI at Northeastern University.

The AI vendor also argues that there is an opt-out option for organizations.

"That one might cut both ways," Bennett said.

If The New York Times did opt out, then OpenAI did not honor its contractual obligations, which is bad for the AI vendor. However, if The New York Times did not opt out, then that weakens its argument, Bennett added.

The strength of the Times' claims against OpenAI lies in the newspaper having successfully built a paid subscription model.

Therefore, OpenAI's argument of fair use becomes complicated, said Mark Beccue, a Futurum Research analyst.

"It's digitally protected as a copyright, but not only copyright but paid-for [content]" Beccue said. "The New York Times has a significant reason for protecting what they do because people pay for it."

The argument of regurgitation

When ChatGPT or other AI models spit out the Times' materials, it means the content that was originally behind a paywall is now free for anyone who uses the model.

In defense of that, OpenAI argues that while it's working on the problem, its models do not typically work in the same way that The New York Times suggests.

The AI vendor says the large publishing company probably manipulated the model to regurgitate, or pick, select examples from multiple attempts.

While this could weaken The New York Times' argument, prompt refinement is normal, Bennett said.

"That's how you get to the best answer or better answers from ChatGPT," he said. "Most of us that are doing that are trying to be thoughtful about it. At least we don't go with the first response. We can refine it and keep digging down."

The public domain and protected data

OpenAI's argument of fair use also raises the question of what content is public domain, Beccue said.

As the generative AI market continues to evolve and move toward smaller models compared with large language models such as GPT-3 and GPT-4, businesses would need to have a strong case for using huge models given questions about their training data, he said. They would need to figure out where their content is coming from, whether is it paid content or free and public.

This shift in the market toward small models and the controversy surrounding large models makes traceable training data attractive, Beccue said.

"It makes the model more accurate," he said. "There are multiple reasons for protecting and identifying data."

For organizations like The New York Times, digital watermarking may be the next step for protecting written content, Beccue added.

Digital watermarking is a technique used to embed a code or marker inside digital content to identify it.

The importance of copyright

Even without digital watermarking, the rules of copyright are clear, said Cathy Wolfe, president and CEO of Wolters Kluwer, an information services vendor and member of the Copyright Clearance Center board of directors.

"Without copyright, innovation is at risk," Wolfe said. "If you're going to spend the time, money and effort to create something ... then someone can just take that, put the work in and commercialize that, then that's going to discourage people from actually innovating."

While the lawsuit between the Times and OpenAI might not end up being decided by a judge or jury, one way to avoid problems is for creators to adhere to the collective licensing system in place, Wolfe said.

Collective licensing allows for the licensing of copyrighted materials on behalf of those who own them.

"There's a clear set of allowed uses and a clear price or different licenses for different kinds of uses," she said. "There actually is a well-established process for this to be handled that could be done without anybody going to court."

Esther Ajao is a TechTarget Editorial news writer covering artificial intelligence software and systems.

Next Steps

Catch up on the latest AI news from the beginning of the year

Dig Deeper on AI technologies

Business Analytics
Data Management