Retrieval-augmented generation (RAG) is a pattern for building language model systems where the model generates responses using both its learned parameters and external knowledge retrieved at inference time, typically from a search index or vector database.
What is Retrieval-Augmented Generation (RAG)?
RAG combines two subsystems, retrieval and generation. In the retrieval step, the system converts the user query into a search representation, often an embedding, then retrieves the most relevant documents, passages, or records from a knowledge store. In the generation step, the system provides the retrieved context to a generative model as grounding material, so the model can cite, summarize, or reason over the retrieved content when writing the answer. RAG reduces reliance on parametric memory, so it can help with freshness and domain specificity, and it can reduce hallucinations when retrieval returns accurate sources. Implementation details matter. Teams tune chunking, embeddings, re-ranking, and context window allocation. They also design prompts that instruct the model to prefer retrieved evidence and to abstain when evidence is missing.
Where it is used and why it matters
RAG is common in enterprise chatbots, internal knowledge assistants, customer support, legal and compliance search, and developer documentation Q&A. It matters because it enables controlled knowledge access, easier updates without retraining, and traceability via citations. It also introduces new failure modes such as poor retrieval recall, irrelevant context injection, and security risks if the knowledge base contains sensitive data or untrusted content. Observability and evaluation are important to detect retrieval and generation errors.
Types
- Naive RAG: single retrieval step, top-k context appended to the prompt.
- Reranked RAG: uses a cross-encoder or LLM to re-rank retrieved candidates before generation.
- Multi-hop RAG: performs iterative retrieval where earlier results trigger follow-up queries.
FAQs
- Is RAG a replacement for fine-tuning?
Not always. RAG helps with up-to-date knowledge and citations, while fine-tuning helps with style, policy, and task behavior. Many systems use both. - What database do you need for RAG?
You can use a vector database, a hybrid keyword plus vector index, or even a traditional search engine, as long as retrieval quality is good. - How do you evaluate a RAG system?
Measure retrieval recall and precision, answer faithfulness to sources, citation accuracy, and end-to-end task success on a representative test set. - What are common RAG failure modes?
Missing relevant chunks, retrieving near-duplicates, overloading the context window, and the model ignoring retrieved evidence.