Tech Accelerator What is GenAI? Generative AI explained

Prev Next

Definition

Retrieval-Augmented Language Model pre-training

Alexander S. Gillis

By

Alexander S. Gillis, Technical Writer and Editor

Published: Jan 30, 2024

What is Retrieval-Augmented Language Model pre-training?

A Retrieval-Augmented Language Model, also referred to as REALM or RALM, is an artificial intelligence (AI) language model designed to retrieve text and then use it to perform question-based tasks.

Pre-training such a system refers to the process of first training the model for one task before training the model to work on another related task or data set. Using an already adjacently trained model is a fast and efficient way to build AI applications, giving the model essentially a head-start in training, when compared to training a new model from scratch. The language model pre-training process also aids in capturing a large amount of world knowledge that can be crucial for neural network natural language processing (NLP) tasks, such as question answering.

Google introduced retrieval-augmented language model pre-training in 2020 in a document about using masked language models, like BERT, to perform open-book question answering. This process uses the corpus -- or the collection of data used to train the AI -- of documents with a language model architecture. This helps the REALM model find documents, their most relevant passages and return the relevant data for information extraction.

Basic REALM architecture

Retrieval-augmented models typically use a semantic retrieval mechanism. For example, REALM uses a knowledge retriever and a knowledge-augmented encoder. The knowledge retriever helps the large language model (LLM) -- a type of AI algorithm that uses deep learning techniques and massively large data sets to understand, summarize, generate and predict new content -- find and focus on specific text from a large knowledge corpus. When the user inputs a prompt, the knowledge retriever's goal is to identify relevant documents. A knowledge-augmented encoder tool is then used to retrieve the correct data from the text. The text and the original prompt are then passed to the LLM to answer the user's initial question.

This article is part of

What is GenAI? Generative AI explained

Which also includes:
8 top generative AI tool categories for 2025
Will AI replace jobs? 18 job types that might be affected
27 of the best large language models in 2025

A diagram showing how a retrieval-augmented language pre-training model works. — In REALM pre-training, a knowledge retriever finds specific text from a large corpus and then uses a knowledge-augmented encoder to retrieve the correct data from the text. The text and the original prompt are then passed to the LLM to answer the user's initial question.

Stages in a pre-training program

Pre-trained programs require a machine learning model and two different data sets. The basic stages include the following:

Train the machine learning model with its initial training data set. Initial training stages typically consist of an assessment stage to determine if training is required; a development stage, where the training material, environment and various tools are developed or chosen; a delivery stage where training begins; and an evaluation benchmark stage where the effectiveness of the training is determined. A diverse initial training data set exposes the model to various features, patterns and representations of data.
Define the model parameters and how it uses the initial training data set. As an example, in REALM, pre-training and fine-tuning tasks are formalized as a retrieve-then-predict generative process.
Begin training the model on the new data set. It's important that the new data set is similar in form to the model's initial training. For example, training a model that's already trained to predict traffic metrics wouldn't be useful if it's then trained to detect objects. But a model trained on object detection would be useful for creating a model that can identify animals.

Pre-training is typically applied for transfer learning, classification or feature extraction.

Transfer learning uses the data gained from one machine learning model for another model.
Classification refers to a machine learning model that's trained for classification-level tasks, such as for classifying images.
Feature extraction identifies and extracts relevant data features from a data set, where the extracted features are then used in another model.

Pros and cons of pre-training

Benefits of pre-training include the following:

Ease of use. Developers don't need to create models from scratch. They can instead find a pre-trained model that was trained on a similar task and train it again to the specific task being worked on.
Optimizes performance. A pre-trained model can reach optimized performance faster, as it might already know what parameters will likely create good results.
Doesn't require large amounts of training data. Pre-trained models don't require as much training data as building a model from scratch. Additionally, models available online are likely to already have been trained on extremely large data sets.
Improves NLP tasks. REALM pre-training improves the efficiency of NLP-related tasks, such as those for question answering.

Potential downsides to pre-training, however, might include the following:

Requires fine-tuning. The fine-tuning process can be resource-intensive and require time for effective tuning.
Produces ineffective results. Using an already trained model for a task that's too different from its initial task won't produce effective results in training.

Retrieval-augmented generation, retrieval-augmented language model and LLMs

Retrieval-augmented language models, LLMs and retrieval-augmented generation (RAG) are all closely related. REALM and RAG are both AI models and frameworks that work with LLMs.

But where REALM is a language model designed to retrieve text from a corpus of initial training data and then use it to answer knowledge-intensive question-based tasks, RAG is designed to access external information, separate from its initial training data. For example, RAG can retrieve data from external sources such as external knowledge bases, databases or the internet.

LLM models typically have a training end date, after which the LLM is unaware of any new events or developments. This means that LLMs typically aren't working with the newest, most up-to-date information -- essentially freezing an LLM's knowledge at a point in time. RAGs get around this limitation by pulling from external sources of information in real time. This improves the quality of responses while reducing AI hallucinations. If an AI model like ChatGPT used RAG, it wouldn't be limited based on its training end date.

REALM can also be paired with zero-shot learning, which is a machine learning concept that recognizes samples from classes that the model wasn't initially trained on.

Pre-training vs. fine-tuning

While pre-training is the concept of training a previously trained machine learning model on a similar task with new training data, fine-tuning refers to the process of refining a pre-trained model to work on particular tasks. Fine-tuning uses a smaller data set with the goal of adjusting and specializing the model to fit a specific task. An example of this is fine-tuning an LLM for sentiment analysis.

Both pre-training and fine-tuning as concepts aren't exclusive, however. For example, a REALM model can be pre-trained and then later fine-tuned. Fine-tuning lets the model take advantage of its broad knowledge from pre-training while also specializing in a specific target task. Fine-tuning also provides better performance in its task.

Learn more about RAG and other currently developing AI and machine learning trends.

Continue Reading About Retrieval-Augmented Language Model pre-training

Generative AI predictions

Generative models: VAEs, GANs, diffusion, transformers, NeRFs

Mind the gap: AI leaders pulling ahead as LLMs take off

Generative AI challenges that businesses should consider

Compare large language models vs. generative AI

Dig Deeper on Enterprise applications of AI

Search Business Analytics

What makes an effective data science team structure?
Data science team structures vary in strength, and their success depends on how roles and leadership align with business goals to...
Synthetic data vs. real data for predictive analytics
Synthetic data helps simulate rare events and meet privacy compliance, while real data preserves natural variability needed to ...
7 predictive analytics skills to improve simulation modeling
Predictive analytics skills such as statistical analysis, data preprocessing and model evaluation can help data professionals ...

Search CIO

How to become a Web 3.0 developer: Required skills and guide
Becoming a Web 3.0 expert means mixing old and new skills.
How to attract tech talent in 2025: 7 essentials
In this time of 'the great churn,' finding and keeping great tech talent sounds merely aspirational. Read on for seven methods ...
Intel CEO's potential China links a warning for U.S. companies
President Donald Trump called for Intel CEO Lip-Bu Tan to resign, another signal of the administration's heightened focus on ...

Search Data Management

MongoDB launches enterprise-focused AI models
The vendor's new models enhance document context capture and enable developers to guide the reranking process with instructions, ...
Top data quality management tools in 2025
Data quality management tools provide profiling, cleansing and monitoring features that keep enterprise data accurate and ...
Aerospike update aims to improve database performance
With the addition of expression indexes that enable users to streamline the data discovery process, the vendor is providing ...

Search ERP

Is geospatial data the real game changer for digital twins?
In the podcast, the CEO of TwinMatrix Technologies explains the benefits and challenges of adding geospatial capabilities to ...
AI and ERP: The digital labor evolution in manufacturing
Despite hype and growing pains, agentic AI finds a home in the enterprise with manufacturing process functionality.
9 top ERP software picks for the retail industry
Some ERP software is better than others for companies that are in the retail industry and need certain functionality. Learn some ...

Close