#Cost, Latency & Optimization¶

A correct prompt that is slow and expensive doesn't survive production. Engineer the economics too.

Where Cost & Latency Come From¶

You pay (and wait) per token: input tokens + output tokens. Latency is dominated by output token count (tokens are generated sequentially) and model size.

High-Leverage Optimizations¶

1. Right-size the model¶

Use the smallest model that passes your eval set. Route easy requests (classification) to a cheap model, hard ones (complex reasoning) to a strong one — a cheap router prompt decides (Module 4).

2. Trim the prompt¶

Remove redundant instructions and verbose examples
Compress/curate RAG context — fewer, better chunks beat many mediocre ones
Drop politeness padding; the model doesn't need "please" to comply

3. Cap and shape output¶

Set max_tokens sensibly — shorter answers are cheaper and faster
Ask for terse formats ("3 bullets, ≤10 words each") when verbosity adds no value
Reserve chain-of-thought for problems that need it (it multiplies output tokens)

4. Prompt caching¶

Many providers cache stable prompt prefixes (system prompt, few-shot block, large context). Put the static content first, dynamic content last so the prefix is reused — large cost/latency savings on repeated calls.

5. Cache & batch at the app layer¶

Cache identical/equivalent requests (semantic cache)
Batch offline workloads instead of one call per item
Stream responses to improve perceived latency even when total time is unchanged

The Trade-off Triangle¶

        Quality
         /    \
        /      \
   Cost ───── Latency

You're always balancing three. Make the trade-off deliberately and measured (Module 5 evals tell you whether a cheaper/faster config still passes).

Optimization Checklist¶

Smallest model that passes evals
Static prefix first (cache-friendly), dynamic input last
Prompt trimmed; RAG context curated
max_tokens capped; output format terse
CoT only where it pays for itself
App-level cache + streaming for UX

Final principle of the course: A production prompt is judged on four axes together — accuracy, reliability, cost, and latency. Engineering only the first is a prototype, not a product.