CoachnestCoachnest
Sign InGet Started
Back to course

Prompt Engineering Mastery: From Fundamentals to Production

…
—
Contents
1

What Is Prompt Engineering?

ReadingFree
2

How Large Language Models Actually Work

ReadingFree
3

Tokens, Context Windows, Temperature & Sampling

Reading11m
4

The Anatomy of a Great Prompt

Reading13m
5

Module 1 Knowledge Check

Quiz8m
6

Zero-Shot, One-Shot & Few-Shot Prompting

Reading12m
7

Role & Persona Prompting

Reading9m
8

Instruction Clarity, Delimiters & Decomposition

Reading11m
9

Controlling the Output Format

Reading10m
10

Module 2 Knowledge Check

Quiz8m
11

Chain-of-Thought Prompting

Reading12m
12

Self-Consistency & Tree-of-Thought

Reading11m
13

ReAct — Reasoning + Acting with Tools

Reading12m
14

Structured Output with JSON Schemas

Reading11m
15

Module 3 Knowledge Check

Quiz8m
16

Retrieval-Augmented Generation (RAG)

Reading13m
17

Prompt Templates, Variables & Chaining

Reading11m
18

Tool / Function Calling Patterns

Reading12m
19

Project — Build a Customer Support Assistant

Reading14m
20

Module 4 Knowledge Check

Quiz8m

Evaluating Prompt Quality

Reading12m
22

Prompt Injection & Security

Reading12m
23

Reducing Hallucinations

Reading10m
24

Cost, Latency & Optimization

Reading10m
25

Final Assessment — Prompt Engineering Mastery

Quiz15m
←→navigate lessons
Chapter 5 of 5·Module 5 · Evaluation, Safety & Production
Lesson 21 of 25Reading12 min

Evaluating Prompt Quality

#Evaluating Prompt Quality¶

"It looked good when I tried it" is not evaluation. Production prompting requires measurement.

Why Eyeballing Fails¶

LLMs are stochastic and prompts have huge surface area. A change that fixes one case can silently break ten others. You need a repeatable, automated way to know if a prompt change is better or worse.

Build an Eval Set¶

A dataset of representative inputs with an expected property for each:

json
5 lines
1[
2  { "input": "Battery dies fast", "expect": { "sentiment": "negative" } },
3  { "input": "Best phone ever!!",  "expect": { "sentiment": "positive" } },
4  { "input": "It's fine I guess",  "expect": { "sentiment": "neutral" } }
5]

Cover: the common case, edge cases, known past failures (a "regression" suite), and adversarial inputs.

Scoring Methods¶

MethodGood forNotes
Exact / schema matchClassification, extraction, JSONCheap, objective
Heuristics / regex"Contains a citation", "≤ 50 words"Fast checks of constraints
Reference similaritySummaries, paraphraseEmbeddings / ROUGE-style
LLM-as-judgeOpen-ended quality, toneUse a rubric; powerful but needs its own validation

LLM-as-Judge Pattern¶

text
3 lines
1You are a strict grader. Given the QUESTION, the REFERENCE,
2and the ANSWER, score the ANSWER 1-5 for factual accuracy
3using ONLY the reference. Output JSON: {"score": n, "reason": "..."}

Note the irony: even your evaluator is a prompt-engineered LLM call — apply every principle from this course to it.

The Iteration Loop¶

change prompt → run eval set → compare metrics to baseline → better? keep + set new baseline → worse? revert + analyse failures

Track accuracy, format-validity rate, cost/latency per version. Never ship a prompt change without running the suite.

Metrics That Matter¶

  • Task accuracy (the obvious one)
  • Format validity % (does code-consumable output parse?)
  • Refusal/escalation correctness (does it say "I don't know" when it should?)
  • Cost & p95 latency per request

Principle: A prompt without an eval set is a guess. Treat prompt changes like code changes — tested, measured, reversible.

Previous

Module 4 Knowledge Check

Next

Prompt Injection & Security

Use ← → arrow keys to navigate between lessons