#Retrieval-Augmented Generation (RAG)¶

The most important production pattern: ground the model in your data so it stops guessing.

The Problem RAG Solves¶

LLMs have a frozen knowledge cutoff and no access to your private docs. Asking about your internal policy → confident hallucination.

The RAG Pipeline¶

                ┌──────── Indexing (offline) ────────┐
Documents → chunk → embed → store in vector DB
                └────────────────────────────────────┘

                ┌──────── Query (online) ────────────┐
User question → embed → similarity search → top-k chunks
        → stuff chunks into prompt as context → LLM → answer
                └────────────────────────────────────┘

The RAG Prompt Template¶

text

11 lines

1Answer the question using ONLY the context below.
2If the answer is not in the context, say "I don't know based on the provided documents."
3Cite the source id for every claim.
4
5<context>
6[1] {chunk_1}
7[2] {chunk_2}
8[3] {chunk_3}
9</context>
10
11Question: {user_question}

Every line here is a prompt-engineering decision:

"ONLY the context" → reduces hallucination
"say I don't know" → explicit fallback (Module 2 principle)
"cite the source id" → makes answers verifiable and trustworthy
delimited context → separates data from instruction (Module 2)

What Makes RAG Fail (and the Prompt Fixes)¶

Failure	Cause	Fix
Hallucinated answer	Weak grounding instruction	"Use ONLY context; else say you don't know"
Ignores retrieved context	Context buried in middle	Put context near the end; keep it tight
Wrong chunk retrieved	Poor chunking/embedding	Chunk by semantic section; overlap; better query
No traceability	No citation requirement	Require `[source_id]` per claim

Chunking Tips¶

Chunk by meaning (sections/paragraphs), not arbitrary fixed length
Add overlap (e.g. 10–15%) so ideas aren't split
Keep chunks small enough that several fit the context window with room for the answer

Core principle: RAG turns "what does the model remember?" into "what can the model read?" — and the prompt is what enforces that discipline.

The RAG Pipeline¶

┌──────── Indexing (offline) ────────┐ Documents → chunk → embed → store in vector DB └────────────────────────────────────┘ ┌──────── Query (online) ────────────┐ User question → embed → similarity search → top-k chunks → stuff chunks into prompt as context → LLM → answer └────────────────────────────────────┘

The RAG Prompt Template¶

text

11 lines

1Answer the question using ONLY the context below.
2If the answer is not in the context, say "I don't know based on the provided documents."
3Cite the source id for every claim.
4
5<context>
6[1] {chunk_1}
7[2] {chunk_2}
8[3] {chunk_3}
9</context>
10
11Question: {user_question}

Every line here is a prompt-engineering decision:

"ONLY the context" → reduces hallucination

"say I don't know" → explicit fallback (Module 2 principle)

"cite the source id" → makes answers verifiable and trustworthy

delimited context → separates data from instruction (Module 2)

What Makes RAG Fail (and the Prompt Fixes)¶

Failure	Cause	Fix
Hallucinated answer	Weak grounding instruction	"Use ONLY context; else say you don't know"
Ignores retrieved context	Context buried in middle	Put context near the end; keep it tight
Wrong chunk retrieved	Poor chunking/embedding	Chunk by semantic section; overlap; better query
No traceability	No citation requirement	Require `[source_id]` per claim

Chunking Tips¶

Chunk by meaning (sections/paragraphs), not arbitrary fixed length

Add overlap (e.g. 10–15%) so ideas aren't split

Keep chunks small enough that several fit the context window with room for the answer

Core principle: RAG turns "what does the model remember?" into "what can the model read?" — and the prompt is what enforces that discipline.