Nabugu - stock.adobe.com

AI cannibalism explained: A model failure

AI cannibalism – training on AI-generated content – creates a feedback loop that worsens bias, degrades quality and risks model collapse, making systems unreliable or unusable.

AI systems require data for training. It's increasingly common today, particularly with generative AI (Gen AI) models, to train them on content created by existing models. That training, based on AI-generated content, is either on the full data set or a distilled version.

Regardless, training AI on AI is called AI cannibalism, or digital cannibalism, and it is closely related to model collapse, a degenerative condition producing increasingly homogenous output.

When AI trains on AI, the model has access only to the data used to train the original model. To continue the cannibalism idea, the offspring models that train, or "feed," on the original model, consuming its outputs, lead to systemic errors in the offspring. Any bias from the original model propagates across all cannibal models trained on the original. It is functionally a recursive feedback loop: Each generation of AI models becomes further disconnected from both the original source data and any new knowledge.

Over time, the cannibal models degrade in knowledge and quality, producing increasingly inaccurate insights and outputs. Ultimately, the models are simply no longer practically useful and collapse.

The importance of data sets in AI

The first stage in constructing modern large language models (LLMs) is the training stage. Here, a model learns, or is trained on, data. This foundational data is the basis on which a model makes decisions and derives insights.

Any AI system's knowledge is based on the original data that trained the system. Certainly, AI can augment the training data set. For example, techniques such as fine-tuning, a post-training process that provides a model with additional data, and retrieval-augmented generation, in which a model has access to data from a vector database, commonly supplement original training data.

With the benefit of the training data sets, a model decides, generating output based on model weights, which help it determine relevance and importance. The model uses all available data to make that decision, primarily the initial training data, plus any fine-tuning or RAG options.

When the model's data is missing or incomplete in some way, it still generates output, but that output is not accurate. Instead, it's AI hallucination. Additionally, AI outputs depend on data sets with diversity of opinion and information. Diversity in datasets limits the risk of AI bias.

When AI trains on other AI, the new AI model inherits any deficiencies in data set diversity or accuracy. Also, multiple models trained on the original AI model all produce the same outputs, further limiting diversity.

The original model's data is considered synthetic data. It is not the original source data set, but rather an AI output representing the original data. Functionally, the data in a cannibalized model is a copy of a copy, carrying the risk of degraded accuracy, lack of information diversity and knowledge gaps. Without the benefit of new, human-generated and real data, eventually AI models lose the ability to understand new input, generate varied responses, create content or innovate in any way.

Risks involved with AI cannibalization

AI cannibalization introduces a series of related risks to users of AI systems, including:

  • Model collapse. In this overarching risk, models train on increasingly artificial data across generations. They begin to lose information about rare, but important, data patterns and produce increasingly similar, and functionally useless, outputs.
  • Model performance degradation. With reduced capacity for unique outputs, models lose the ability to manage edge cases and maintain the statistical diversity necessary for robust performance. This degradation, too, manifests as predictable outputs, reduced creativity and an increased likelihood of hallucination or contextually inappropriate responses.
  • Information pollution. AI cannibalism creates a risk that, in time, more AI-generated content exists than man-made content, polluting the internet with derivative content that lacks originality or insight.
  • Erosion of trust. As a model's quality degrades, its accuracy declines. This inaccuracy erodes trust in AI models.
  • Lack of innovation. When models use the same patterns and data, innovation is stifled.
  • Bias boosting. When AI models are trained on synthetic data from earlier AI systems with existing bias, the cannibalized model amplifies that bias.
  • Diversity deficiency. Lacking human-generated data sets, AI models struggle to absorb the readily available diversity of opinion and knowledge.

      Answers to AI cannibalization

      Untreated AI cannibalization leads to significant performance issues. Not surprisingly, reliable solutions involve improving data quality across several related areas, including:

      • Data curation. Fundamentally, AI cannibalism results from using poor, AI-generated or synthetic data sets. Data curation is the process of organizing, cleaning and maintaining data sets. Properly maintained data sets ensure the right data is used to train models.
      • Data lineage tracking. One critical aspect of curation is tracking data’s origin, the process of data lineage, a valuable informational tool in shrinking the risk of AI cannibalism.
      • Data governance policies. Data curation and data lineage, while helpful, become impactful only when part of a proper data governance practice. These policies and procedures, no longer optional for today’s data initiatives, improve an organization’s AI training and data usage overall.
      • AI content detection technologies. When possible, deploy tools that reliably distinguish between human and AI-generated content. Identifying potentially AI-generated content is crucial in maintaining data quality and limiting AI cannibalism.

      Safeguarding AI's future

      AI cannibalism threatens the very existence of AI utility, but vendors, model makers, industry associations and policymakers must take steps to safeguard AI for the future, emphasizing:

      • Human content creation. At its core, AI cannibalism risks losing the benefit of any new information. Focusing on human content generation and prioritizing the collection of authentic human-generated content reduces this risk.
      • Quality-over-quantity data strategies. Many AI efforts focus on more data to improve the model. Shifting focus from massive data sets to higher-quality data sets—with verified, human-generated data—maintains model performance while reducing susceptibility to cannibalization.
      • Collaborative industry standards. Establishing industry standards for data labeling, content verification and model training practices safeguards the future.
      • Regulatory initiatives. Policy and regulatory initiatives improve data compliance best practices, another way to reduce AI cannibalism risk. Early efforts in the space include the European Union’s AI Act, which defines requirements and transparency standards that indirectly address data quality concerns. In the U.S., nascent efforts include the COPIED Act of 2024, which seeks content authentication standards.

      Sean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He has pulled Token Ring, configured NetWare and been known to compile his own Linux kernel. He consults with industry and media organizations on technology issues.

      Dig Deeper on Artificial intelligence