
Getty Images
How to prepare data for your RAG pipeline
Because accurate data retrieval is critical for RAG systems, development teams must implement effective data preprocessing strategies in their RAG pipeline.
Retrieval-augmented generation connects generative AI to enterprise data. However, without a proper RAG data pipeline, the RAG system might not find the most accurate data to answer user queries.
Data quality is paramount in RAG systems. Therefore, building a successful RAG application includes data preprocessing tasks, such as content filtering, text normalization, chunking, metadata tagging and embedding generation.
To ensure proper data preparation, RAG development teams must understand how RAG makes data searchable, explore data preprocessing strategies such as chunking methods, and learn how to build a RAG data pipeline -- from selecting and cleaning data to embedding and storing it in a vector database.
How RAG makes data searchable
RAG helps generative AI services access external data by using a vector store or database with some search mechanisms.
First, to add unstructured data -- such as a document -- into a vector store, teams need to embed the data through an embedding model.
An embedding converts text into numerical vectors -- a list of decimals that represent text, such as words or sentences, in a way a computer can understand.
For example, consider the following paragraph from a document:
We launched Copilot last week, and adoption has been stellar. Early metrics show a 32% drop in ticket volume and a 27% jump in self-service resolution within the first 48 hours. Finance estimates an annual savings of $1.4 million if the trend holds.
Embedding this paragraph into vectors that represent the text looks like this:
[0.81, -0.23, 0.45, 0.12, -0.67, 0.94, 0.30, -0.55, 0.76, 0.10]
This is a highly simplified example; for instance, if teams used an embedding model such as OpenAI's text-embedding-3-small, it would output a 1536-dimensional vector. Multiple embedding models are available from different providers. Hugging Face has an embedding leaderboard that shows the strengths and weaknesses of different embedding models.
These vectors and the original text are stored in a vector store such as Pinecone or Weaviate. The data in the vector store could look something like this:
{
"id": "unique-id-123",
"values": [0.81, -0.23, ..., 0.789], // the vector (embedding)
"metadata": {
"text": "We launched Copilot last week and adoption has been stellar….",
"source": "Document A",
"page": 3
}
}
When a user wants to ask a question related to the data, it sends the prompt through an orchestrator -- such as LangGraph, LangChain or Semantic Kernel -- that then sends a request to the vector store using a search method. The search method could return two or more chunks of relevant content and provide this information to the large language model (LLM) that generates an answer to the user.
The need for data chunking
Chunking is a standard preprocessing method that turns a data set into smaller sections, called chunks. Since vector stores cannot store all the original data, it must be split into chunks.
For instance, the Copilot paragraph could be chunked into two sections:
Chunk 1:
We launched Copilot last week, and adoption has been stellar.
Chunk 2:
Early metrics show a 32% drop in ticket volume and a 27% jump in self-service resolution within the first 48 hours. Finance estimates an annual savings of $1.4 million if the trend holds.
The problem with this chunking approach is that some chunks might lose their original meaning. For example, if a user asked about Copilot in the RAG system, the search mechanism would only return the first chunk, not the second one, since there is no correlation between the two. The user doesn't get the added context and information about Copilot adoption without the second chunk.
This is where chunking strategies can help chunk data to fit into the vector store and retain its original meaning in a RAG engine.
4 chunking strategies
Without proper strategies, RAG systems can chunk data incorrectly. Different chunking methods are available, including fixed-size chunking, variable-size chunking based on content, rule-based chunking and sliding window chunking.
1. Fixed-size chunking
This method involves dividing text into equal-sized segments, such as 400 words or 800 characters per chunk, regardless of the content.
2. Variable-size chunking based on content
Here, the text is split according to natural boundaries, such as sentence-ending punctuation, line breaks or structural cues, identified by natural language processing (NLP) tools that analyze document layout and meaning.
3. Rule-based chunking
This approach relies on the document's intrinsic structure or linguistic boundaries. Typical methods include chunking by sentences or paragraphs using predefined rules.
4. Sliding window chunking
This technique produces overlapping chunks by sliding a fixed-size window through the text with a set size. For example, each chunk might contain 500 words, with the next chunk starting 300 words in to create a 200-word overlap between chunks.
For the Copilot paragraph, a sliding window or rule-based approach would ensure that the paragraph would not lose its meaning since both pieces of content would be available in the same chunk.
6 RAG data pipeline steps
Another consideration in RAG systems is keeping data up to date. For example, if the original data is a Word document before being embedded into a vector store, the data could quickly lose value if not regularly updated.
Also, documents that cannot be directly embedded, such as PDF files, need to be converted into a text-friendly format to push them into the vector store.
Therefore, it is crucial to build a data pipeline that can automate the process of continuously embedding the vector store with new data from different sources and file types.
A RAG data pipeline usually consists of the following six steps:
1. Corpus selection and ingestion
Choose the most relevant data sources and content depending on the use case.
2. Data preprocessing and parsing
Clean and standardize the raw data so it's ready for embedding and retrieval. A commonly used tool is MarkItDown. It supports PDF, PowerPoint and many other file formats and can convert the content to Markdown format, which makes it easier to structure when chunking.
3. Enrichment
Remove noisy content and add helpful metadata to the data, such as adding the document title on top so that it will be available during the chunking process.
4. Filtering
Remove any irrelevant or low-value documents that don't support the use case.
5. Chunking
Split the cleaned data into smaller, logical chunks for better retrieval performance. Frameworks such as LangChain have different chunking methods, such as LangChain CharacterTextSplitter.
6. Embedding
Convert each chunk into a vector and store it in a vector store.
Teams can automate much of this process with cloud services and frameworks. For instance, Azure AI Search, a Microsoft cloud-based vector store and search service, provides a schedule feature that automatically pulls data from a storage account, embeds it on a predefined schedule and puts it into Azure AI Search.
Marius Sandbu is a cloud evangelist for Sopra Steria in Norway who mainly focuses on end-user computing and cloud-native technology.