
Retrieval-Augmented Generation (RAG) is an architecture pattern that enhances LLM responses by first retrieving relevant documents from an external knowledge base, then including those documents in the model's context when generating a response. Instead of relying solely on knowledge encoded during training (which can be outdated or incomplete), a RAG system searches a curated document corpus — company documentation, product databases, research papers, or any structured knowledge — and feeds the most relevant passages to the LLM alongside the user's question. This grounds the model's response in actual source material, dramatically reducing hallucinations and enabling answers based on information the model was never trained on.
Why it matters
RAG solves the two biggest problems with raw LLM deployment: hallucination and knowledge staleness. An LLM's training data has a cutoff date, and it has no access to proprietary organizational knowledge. RAG bridges both gaps by giving the model a "research step" before answering — consulting your actual documentation rather than relying on training-time memories. Organizations implementing RAG report accuracy improvements from 60-70% (raw LLM) to 90-95% (RAG-augmented) for knowledge-intensive tasks. RAG is also dramatically cheaper than fine-tuning for knowledge injection — updating the document corpus is instant and free, while retraining a model costs thousands and takes days. For these reasons, RAG has become the default architecture for enterprise AI assistants, customer support bots, and internal knowledge systems.
How it works
A RAG pipeline operates in three stages. First, the indexing stage: documents are split into chunks (paragraphs or sections), each chunk is converted to an embedding vector using an embedding model, and these vectors are stored in a vector database. Second, the retrieval stage: when a user asks a question, their query is also converted to an embedding, and the vector database finds the most similar document chunks using similarity search. Third, the generation stage: the retrieved chunks are inserted into the LLM's prompt as context, and the model generates a response grounded in that specific information. Advanced RAG implementations add reranking (scoring retrieved documents for relevance), hybrid search (combining semantic and keyword search), query transformation (reformulating the user's question for better retrieval), and citation tracking (linking response claims to source documents).
Example
A SaaS company deploys an AI support agent for their platform with 2,000 pages of documentation, 500 knowledge base articles, and 50 troubleshooting guides. Without RAG, the LLM answers based on its general training data — it knows about software support patterns but not the specific product. It hallucinates feature names, invents configuration steps, and references outdated workflows. With RAG, each customer question triggers a semantic search across the documentation corpus: "How do I configure SSO with Okta?" retrieves three relevant setup guide sections. The LLM generates its response using those specific sections as context, producing accurate, product-specific instructions with links to the source documentation. Resolution rate improves from 40% to 78%, and the system gracefully handles the 22% it cannot resolve by escalating with the retrieved context attached for the human agent.