Retrieval-augmented generation connects generative AI to enterprise data. However, without a proper RAG data pipeline, the RAG system might not find the most accurate data to answer user queries.

Data quality is paramount in RAG systems. Therefore, building a successful RAG application includes data preprocessing tasks, such as content filtering, text normalization, chunking, metadata tagging and embedding generation.

To ensure proper data preparation, RAG development teams must understand how RAG makes data searchable, explore data preprocessing strategies such as chunking methods, and learn how to build a RAG data pipeline -- from selecting and cleaning data to embedding and storing it in a vector database.

How RAG makes data searchable RAG helps generative AI services access external data by using a vector store or database with some search mechanisms. First, to add unstructured data -- such as a document -- into a vector store, teams need to embed the data through an embedding model. An embedding converts text into numerical vectors -- a list of decimals that represent text, such as words or sentences, in a way a computer can understand. For example, consider the following paragraph from a document: We launched Copilot last week, and adoption has been stellar. Early metrics show a 32% drop in ticket volume and a 27% jump in self-service resolution within the first 48 hours. Finance estimates an annual savings of $1.4 million if the trend holds. Embedding this paragraph into vectors that represent the text looks like this: [0.81, -0.23, 0.45, 0.12, -0.67, 0.94, 0.30, -0.55, 0.76, 0.10] This is a highly simplified example; for instance, if teams used an embedding model such as OpenAI's text-embedding-3-small, it would output a 1536-dimensional vector. Multiple embedding models are available from different providers. Hugging Face has an embedding leaderboard that shows the strengths and weaknesses of different embedding models. These vectors and the original text are stored in a vector store such as Pinecone or Weaviate. The data in the vector store could look something like this: { "id": "unique-id-123", "values": [0.81, -0.23, ..., 0.789], // the vector (embedding) "metadata": { "text": "We launched Copilot last week and adoption has been stellar….", "source": "Document A", "page": 3 } } When a user wants to ask a question related to the data, it sends the prompt through an orchestrator -- such as LangGraph, LangChain or Semantic Kernel -- that then sends a request to the vector store using a search method. The search method could return two or more chunks of relevant content and provide this information to the large language model (LLM) that generates an answer to the user.

The need for data chunking Chunking is a standard preprocessing method that turns a data set into smaller sections, called chunks. Since vector stores cannot store all the original data, it must be split into chunks. For instance, the Copilot paragraph could be chunked into two sections: Chunk 1: We launched Copilot last week, and adoption has been stellar. Chunk 2: Early metrics show a 32% drop in ticket volume and a 27% jump in self-service resolution within the first 48 hours. Finance estimates an annual savings of $1.4 million if the trend holds. The problem with this chunking approach is that some chunks might lose their original meaning. For example, if a user asked about Copilot in the RAG system, the search mechanism would only return the first chunk, not the second one, since there is no correlation between the two. The user doesn't get the added context and information about Copilot adoption without the second chunk. This is where chunking strategies can help chunk data to fit into the vector store and retain its original meaning in a RAG engine.

4 chunking strategies Without proper strategies, RAG systems can chunk data incorrectly. Different chunking methods are available, including fixed-size chunking, variable-size chunking based on content, rule-based chunking and sliding window chunking. 1. Fixed-size chunking This method involves dividing text into equal-sized segments, such as 400 words or 800 characters per chunk, regardless of the content. 2. Variable-size chunking based on content Here, the text is split according to natural boundaries, such as sentence-ending punctuation, line breaks or structural cues, identified by natural language processing (NLP) tools that analyze document layout and meaning. 3. Rule-based chunking This approach relies on the document's intrinsic structure or linguistic boundaries. Typical methods include chunking by sentences or paragraphs using predefined rules. 4. Sliding window chunking This technique produces overlapping chunks by sliding a fixed-size window through the text with a set size. For example, each chunk might contain 500 words, with the next chunk starting 300 words in to create a 200-word overlap between chunks. For the Copilot paragraph, a sliding window or rule-based approach would ensure that the paragraph would not lose its meaning since both pieces of content would be available in the same chunk.