"It looked good when I tried it" is not evaluation. Production prompting requires measurement.
LLMs are stochastic and prompts have huge surface area. A change that fixes one case can silently break ten others. You need a repeatable, automated way to know if a prompt change is better or worse.
A dataset of representative inputs with an expected property for each:
1[
2 { "input": "Battery dies fast", "expect": { "sentiment": "negative" } },
3 { "input": "Best phone ever!!", "expect": { "sentiment": "positive" } },
4 { "input": "It's fine I guess", "expect": { "sentiment": "neutral" } }
5]Cover: the common case, edge cases, known past failures (a "regression" suite), and adversarial inputs.
| Method | Good for | Notes |
|---|---|---|
| Exact / schema match | Classification, extraction, JSON | Cheap, objective |
| Heuristics / regex | "Contains a citation", "≤ 50 words" | Fast checks of constraints |
| Reference similarity | Summaries, paraphrase | Embeddings / ROUGE-style |
| LLM-as-judge | Open-ended quality, tone | Use a rubric; powerful but needs its own validation |
1You are a strict grader. Given the QUESTION, the REFERENCE,
2and the ANSWER, score the ANSWER 1-5 for factual accuracy
3using ONLY the reference. Output JSON: {"score": n, "reason": "..."}Note the irony: even your evaluator is a prompt-engineered LLM call — apply every principle from this course to it.
change prompt → run eval set → compare metrics to baseline
→ better? keep + set new baseline
→ worse? revert + analyse failures
Track accuracy, format-validity rate, cost/latency per version. Never ship a prompt change without running the suite.
Principle: A prompt without an eval set is a guess. Treat prompt changes like code changes — tested, measured, reversible.