#Evaluating Prompt Quality¶

"It looked good when I tried it" is not evaluation. Production prompting requires measurement.

Why Eyeballing Fails¶

LLMs are stochastic and prompts have huge surface area. A change that fixes one case can silently break ten others. You need a repeatable, automated way to know if a prompt change is better or worse.

Build an Eval Set¶

A dataset of representative inputs with an expected property for each:

json

5 lines

1[
2  { "input": "Battery dies fast", "expect": { "sentiment": "negative" } },
3  { "input": "Best phone ever!!",  "expect": { "sentiment": "positive" } },
4  { "input": "It's fine I guess",  "expect": { "sentiment": "neutral" } }
5]

Cover: the common case, edge cases, known past failures (a "regression" suite), and adversarial inputs.

Scoring Methods¶

Method	Good for	Notes
Exact / schema match	Classification, extraction, JSON	Cheap, objective
Heuristics / regex	"Contains a citation", "≤ 50 words"	Fast checks of constraints
Reference similarity	Summaries, paraphrase	Embeddings / ROUGE-style
LLM-as-judge	Open-ended quality, tone	Use a rubric; powerful but needs its own validation

LLM-as-Judge Pattern¶

text

3 lines

1You are a strict grader. Given the QUESTION, the REFERENCE,
2and the ANSWER, score the ANSWER 1-5 for factual accuracy
3using ONLY the reference. Output JSON: {"score": n, "reason": "..."}

Note the irony: even your evaluator is a prompt-engineered LLM call — apply every principle from this course to it.

The Iteration Loop¶

change prompt → run eval set → compare metrics to baseline
   → better? keep + set new baseline
   → worse?  revert + analyse failures

Track accuracy, format-validity rate, cost/latency per version. Never ship a prompt change without running the suite.

Metrics That Matter¶

Task accuracy (the obvious one)
Format validity % (does code-consumable output parse?)
Refusal/escalation correctness (does it say "I don't know" when it should?)
Cost & p95 latency per request

Principle: A prompt without an eval set is a guess. Treat prompt changes like code changes — tested, measured, reversible.