Retrieval-Augmented Generation (RAG)· AI architecture pattern
Retrieval-Augmented Generation (RAG) — Definition, How It Works, and Examples (2026)
Quick answer~1 min
What RAG is
A plain LLM has two problems for production chatbot use:
- Its knowledge is frozen at training time. GPT-4 doesn't know what changed in your product yesterday.
- It hallucinates. When asked something it doesn't know, it confidently invents plausible-sounding answers.
RAG solves both. The flow:
- User asks a question.
- The system retrieves relevant documents from a knowledge base — typically using vector search to find semantically similar passages.
- The retrieved passages are inserted into the LLM's prompt as context.
- The LLM generates a response grounded in those passages, often with inline citations to the source documents.
The user gets an answer that reflects the current state of your knowledge base, not just what the LLM happened to memorize during training.
The technical architecture
A production RAG system has five components:
flowchart TB
subgraph Ingest[Offline ingestion - runs when docs change]
D1[Source docs<br/>PDFs · URLs · Notion<br/>Help Center · wikis] --> D2[Chunk into passages<br/>200-1000 tokens each]
D2 --> D3[Embed each chunk<br/>via embedding model]
D3 --> D4[(Vector DB<br/>Pinecone · Weaviate<br/>pgvector · Chroma)]
end
subgraph Query[Online query - runs every user turn]
U[User question] --> Q1[Embed question]
Q1 --> Q2[Search vector DB<br/>top-K nearest chunks]
Q2 --> Q3[Optional: rerank<br/>cross-encoder]
Q3 --> Q4[Inject chunks + question<br/>into LLM prompt]
Q4 --> Q5[LLM generates<br/>grounded answer with citations]
Q5 --> R[Reply to user]
end
D4 -.-> Q2
Figure 1. RAG architecture splits offline (ingestion) and online (query) paths. The vector database is the shared state between them. The LLM's hallucination risk drops sharply when the prompt includes retrieved passages and instructs the model to ground its answer in them.
1. Document ingestion pipeline
The source content (product docs, FAQ articles, internal wikis, Notion pages, Google Docs, PDF manuals) is loaded, cleaned, and chunked into smaller passages — typically 200-1,000 tokens per chunk. Chunking strategy matters: too small, and context is lost; too large, and retrieval becomes imprecise.
2. Embedding model
Each chunk is converted to an embedding — a high-dimensional vector (typically 512-3,072 dimensions) that captures its semantic meaning. Embeddings are stored in a vector database — Pinecone, Weaviate, Qdrant, pgvector, Chroma, and increasingly built into LLM platforms themselves.
3. Retrieval
At query time, the user's question is also embedded, and the vector database finds the K nearest chunks (typically K = 3-10) by cosine similarity. More advanced systems use hybrid search (combining vector similarity with keyword matching through BM25) for better recall.
4. Reranking (optional)
The top K retrieved chunks may be reranked using a cross-encoder model for relevance. This adds latency but improves precision in domains with many similar-looking documents.
5. Generation
The retrieved chunks + the user's question + a system prompt are passed to the LLM. The LLM is instructed to answer using only the provided context and to cite sources. The output is returned to the user, ideally with links to the source documents.
When RAG matters
RAG is essential for:
- Customer support chatbots. When users ask "how do I configure feature X?", the chatbot needs to answer from your current docs, not from what GPT-4 happened to read about your product (probably nothing useful).
- Internal knowledge agents. Employee questions about HR policies, expense rules, code conventions — all need grounding in your specific documents.
- Compliance-sensitive domains. When wrong information has consequences (medical, legal, financial within an organization), RAG provides traceability — every claim links to a source the user can verify.
- Frequently-updated domains. If your product changes weekly, retraining or fine-tuning the LLM is impractical; updating the knowledge base is cheap.
RAG matters less for:
- General-knowledge tasks. Asking "who won the 1996 World Series" doesn't need RAG — the LLM's training already covers this.
- Creative or open-ended tasks. Writing a marketing email or brainstorming ideas doesn't need document retrieval.
- Math, code, and reasoning — these benefit more from tool use (calling a calculator, running code) than from document retrieval.
RAG in chatbot platforms
Most modern chatbot platforms built for customer support include RAG as a first-class feature, branded as "knowledge base", "AI training", or "chatbot knowledge":
- Chatbase — pure RAG platform. Upload docs (PDFs, URLs, raw text), and the bot answers from them.
- Botpress — RAG built into Knowledge Base nodes inside flows. Open-source, runs your own embedding model if desired.
- Voiceflow — similar pattern; upload documents, set retrieval params, and invoke in flows.
- Manychat — AI Replies feature uses RAG against your training docs to answer DMs accurately. Lower control over chunking / retrieval params than dedicated platforms.
- Intercom Fin — RAG on top of your Intercom Help Center content automatically.
The level of operator control varies. SMB platforms (Manychat, SendPulse, Intercom Fin) expose "upload docs, set tone, go." AI-agent-category platforms (Botpress, Voiceflow) let you tune chunking strategy, embedding model, retrieval K, reranker on/off, and system-prompt grounding instructions.
RAG vs fine-tuning
These are sometimes positioned as alternatives, but they solve different problems:
- RAG — adds knowledge to a generic LLM at query time. Cheap, fast to update, citation-friendly.
- Fine-tuning — adapts the LLM's weights through additional training. Expensive, slow to update, but can shift tone, style, and domain reasoning patterns.
For most chatbot applications, RAG is the right answer. Fine-tuning becomes worth considering when:
- You need consistent tone / style at scale (a brand voice can be fine-tuned in)
- Your domain has specialized vocabulary the base model doesn't handle well
- You have enough labeled examples (typically thousands) to make fine-tuning worthwhile
Many production systems use both: a fine-tuned base for tone and domain familiarity, plus RAG for current factual content.
RAG limitations
- Retrieval misses. If the relevant document isn't retrieved, the LLM falls back to making things up (or admitting it doesn't know, depending on prompt). Tuning retrieval is critical.
- Context window pressure. Retrieved chunks compete with conversation history for the LLM's context budget. Long conversations + many retrieved chunks → truncation.
- Stale knowledge base. RAG is only as good as the indexed content. Outdated docs produce outdated answers, just with more confidence.
- Latency. Retrieval adds 100-500 ms to query time; reranking adds more. Acceptable for chat but expensive for high-throughput applications.
- Cost. Each query embeds the question, queries the vector DB, retrieves K chunks, passes them all to the LLM. Token cost adds up.
Related terms
- Large language model — the generative component RAG augments.
- AI agent — agents commonly use RAG for knowledge retrieval as one of their tools.
- Natural Language Processing — the broader field RAG sits within.
- Model Context Protocol — MCP servers often implement RAG retrieval as an exposed resource.
FAQ
Is RAG the same as fine-tuning?
No. RAG adds documents to the LLM's context at query time; fine-tuning trains the LLM on additional data, baking new patterns into its weights. RAG is cheaper and faster to update; fine-tuning gives consistent tone and style.
Do I need RAG for my chatbot?
If your chatbot answers questions specific to your business (products, policies, accounts), yes. If it's purely marketing automation (welcome flows, abandoned-cart messages, lead capture), no — predefined flows handle these without RAG.
Can RAG hallucinate?
Yes — though less than a plain LLM. RAG reduces hallucination by grounding answers in retrieved documents, but the LLM can still misinterpret retrieved content, combine information incorrectly, or fall back to training data when retrieval fails. System prompts that explicitly say "answer only from the provided context, and say I don't know otherwise" help.
What's the difference between RAG and a knowledge base?
A knowledge base is the underlying content (your docs, FAQs, wikis); RAG is the technique that uses it to ground an LLM. Knowledge bases existed long before LLMs (helpdesk software, wikis, search engines all use them). RAG is one way to connect a knowledge base to a generative AI.
How much does RAG cost?
For a moderate-volume customer-support chatbot, RAG adds roughly $0.01-0.05 in LLM costs per query (embedding + retrieval + chunk-augmented generation), plus vector database hosting ($20-200/month for SMB scale). At very high volume or with frequent re-indexing, costs grow.
Sources
- Lewis, Patrick et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arxiv.org/abs/2005.11401.
- Anthropic. Contextual retrieval. anthropic.com/news/contextual-retrieval (verified 26 May 2026).
- OpenAI. Embeddings and retrieval guide. platform.openai.com/docs/guides/embeddings (verified 26 May 2026).
- LangChain documentation. Retrieval-augmented generation patterns. python.langchain.com/docs/use_cases/question_answering (verified 26 May 2026).