The most important production pattern: ground the model in your data so it stops guessing.
LLMs have a frozen knowledge cutoff and no access to your private docs. Asking about your internal policy → confident hallucination.
┌──────── Indexing (offline) ────────┐
Documents → chunk → embed → store in vector DB
└────────────────────────────────────┘
┌──────── Query (online) ────────────┐
User question → embed → similarity search → top-k chunks
→ stuff chunks into prompt as context → LLM → answer
└────────────────────────────────────┘
1Answer the question using ONLY the context below.
2If the answer is not in the context, say "I don't know based on the provided documents."
3Cite the source id for every claim.
4
5<context>
6[1] {chunk_1}
7[2] {chunk_2}
8[3] {chunk_3}
9</context>
10
11Question: {user_question}Every line here is a prompt-engineering decision:
| Failure | Cause | Fix |
|---|---|---|
| Hallucinated answer | Weak grounding instruction | "Use ONLY context; else say you don't know" |
| Ignores retrieved context | Context buried in middle | Put context near the end; keep it tight |
| Wrong chunk retrieved | Poor chunking/embedding | Chunk by semantic section; overlap; better query |
| No traceability | No citation requirement | Require [source_id] per claim |
Core principle: RAG turns "what does the model remember?" into "what can the model read?" — and the prompt is what enforces that discipline.