Getty Images

Feature

Model collapse explained: How synthetic training data breaks AI

Without human-generated training data, AI systems malfunction. This could be a problem if the internet becomes flooded with AI-generated content.

Ben Lutkevich

By

Ben Lutkevich, Site Editor

Published: 07 Jul 2023

Garbage in, garbage out. Data pollution is ruining generative AI's future.

A recent study by researchers in Canada and the U.K. explained the phenomenon of model collapse. Model collapse occurs when new generative models train on AI-generated content and gradually degenerate as a result.

In this scenario, models start to forget the true underlying data distribution, even if the distribution does not change. This means that the models begin to lose information about the less common -- but still important -- aspects of the data. As generations of AI models progress, models start producing increasingly similar and less diverse outputs.

Generative AI models need to train on human-produced data to function. When trained on model-generated content, new models exhibit irreversible defects. Their outputs become increasingly "wrong" and homogenous. Researchers found that even in the best learning conditions, model collapse was inevitable.

Why is model collapse important?

Model collapse is important because generative AI is poised to bring about significant change in digital content. More and more online communications are being partially or completely generated using AI tools. Generally, this phenomenon has the potential to create data pollution on a large scale. Although creating large quantities of text is more efficient than ever, model collapse states that none of this data will be valuable to train the next generation of AI models.

More AI-generated data pollution makes data from human interactions with systems harder to find and more valuable. Companies and platforms with access to human-generated data will be more likely to create high-quality AI-generated models. Companies that were able to scrape the web before AI pollution will have an advantage over those scraping the post-ChatGPT web for quality training data.

How does model collapse occur?

Model collapse happens when new AI models are trained on generated or synthetic data from older models. The new models become too dependent on patterns in the generated data. Model collapse is based on the principle that generative models are replicating patterns that they have already seen, and there is only so much information that can be pulled from those patterns.

In model collapse, probable events are overestimated and improbable events are underestimated. Through repeated generations, probable events poison the data set, and tails shrink. Tails are the improbable but important parts of the data set that help maintain model accuracy and output variance. Over generations, models compound errors and more drastically misinterpret data.

Researchers define two types of model collapse: early and late. In early model collapse, the model begins to lose information about probability tails. In late model collapse, the model blends together what should be distinct patterns in the data. Eventually, the outputs become increasingly similar to each other with little resemblance to the original data.

The previously mentioned study, "The Curse of Recursion: Training on Generated Data Makes Models Forget," tested three AI model types by repeatedly feeding them model-generated data. In all three cases, the researchers found instances of model collapse:

Gaussian mixture model (GMM). A GMM is designed to separate data into clusters using a Gaussian distribution. Within 50 re-generations, the data distribution completely changed. By generation 2,000, there was no longer any variance in the data.
Variational autoencoder (VAE). The VAE was trained on real data and used to generate images of handwritten digits. The next generations were trained on model-generated data. As the generations progressed, the images got progressively blurrier until each digit resembled a roughly uniform smudge.
Large language model (LLM). The LLM -- OPT-125m -- was fine-tuned using only artificial model data in one scenario and a mixture of human-generated data and artificial data in another. Researchers found that although model performance degraded over time, some level of learning was possible with generated data. Still, the example given in the study showed several outputs from OPT-125m responding to prompts about medieval architecture in which, by the fourth generation, the model was outputting completely unrelated text about jackrabbits.

Future of model collapse

The effects of model collapse -- long-term poisoning of language model data sets -- has been occurring since before the mainstreaming of technology such as ChatGPT. Content farms have been used for years to intentionally influence search algorithms and social networks to make changes in their valuation of content. For example, Google devalues content that appears to be farmed or low-value, and it focuses more on rewarding content from trustworthy sources such as education domains.

Researchers argue that there will be an increased need for ways to distinguish between artificially generated data and data that comes from humans. There is currently no way to track LLM data at scale.

In the immediate future, it is likely that companies with a stake in creating the next generation of machine learning models will rush to acquire human data wherever possible. They will do this in anticipation of a future where it will be even harder to distinguish human-generated data from synthetic data.

The Internet Archive recently experienced an outage due to extremely fast, high-volume requests for its public domain optical character recognition files. The Internet Archive called the requests "abusive traffic" in a tweet and said the traffic came from an AWS customer, speculating that it was an AI company harvesting data.

Model collapse vs. modal collapse

The term model collapse is inspired by literature on modal collapse in generative adversarial networks (GANs). The terms are similar, but have some key differences:

Modal collapse is specific to GANs, which are a type of machine learning model. It occurs when the generator in a GAN begins producing a very limited variety of samples regardless of the input. It is called modal collapse because it fails to capture the multiple modes in a diverse data distribution.
Model collapse is a general term for a model failing to learn properly. It applies to many types of machine learning models and generative AI systems, including LLMs, VAEs and GMMs. It is inherent in all machine learning models and is a result of using synthetic training data.

How to prevent model collapse in LLMs

Although there is no agreed-upon way to track LLM-generated content at scale, one proposed option is community-wide coordination among organizations involved in LLM creation to share information and determine the origins of data.

In the meantime, to avoid being affected by model collapse, companies should try to preserve access to pre-2023 bulk stores of data.

Next Steps

The future of generative AI: How will it impact the enterprise?

Dig Deeper on Data analytics and AI

Networking

What is signal-to-noise ratio and how is it measured?
A signal-to-noise ratio compares the strength of a desired signal with any undesired signals created by background noise.
What is the OSI model? The 7 layers of OSI explained
The OSI model (Open Systems Interconnection model) is a multilayered reference model that shows how computer systems and ...
What is shielded twisted pair (STP) and how does it work?
Shielded twisted pair (STP) is a kind of cable made up of smaller wires where each small pair of wires is twisted together and ...

Security

What is the ISO 31000 Risk Management standard?
The ISO 31000 Risk Management framework is an international standard that provides organizations with guidelines and principles ...
What is vulnerability management? Definition, process and strategy
Vulnerability management is the process of identifying, assessing, remediating and mitigating security vulnerabilities in ...
What is phishing? Understanding enterprise phishing threats
Phishing is a fraudulent practice in which an attacker masquerades as a reputable entity or person to trick users into revealing ...

Search CIO

What is a quantum circuit? Quantum vs. classical circuit
Quantum circuits are systems consisting of logic gates that operate on quantum bits (qubits) to process information and perform ...
What is prescriptive analytics?
Prescriptive analytics is a type of data analytics that provides guidance on what should happen next.
What is the Risk Management Framework (RMF)?
The Risk Management Framework (RMF) is a template and guideline organizations use to identify, eliminate and minimize risks.

HRSoftware

What is an applicant tracking system (ATS)?
An applicant tracking system (ATS) is software that manages the recruiting and hiring process, including job postings and job ...
What is manager self-service?
Manager self-service is a type of human resource management (HRM) platform that gives supervisors immediate access to employee ...
What is performance management software?
Performance management software is a tool that enables human resources (HR) teams to measure and track the performance of ...

Customer Experience

What is field service management (FSM)?
Field service management (FSM) is a system of managing off-site workers and the resources they require to do their jobs ...
What are customer service and support?
Customer service is the support organizations offer to customers before, during and after purchasing a product or service.
What is quality of experience (QoE or QoX)?
Quality of experience (QoE or QoX) is a measure of the overall level of a customer's satisfaction and experience with a product ...

Close