A correct prompt that is slow and expensive doesn't survive production. Engineer the economics too.
You pay (and wait) per token: input tokens + output tokens. Latency is dominated by output token count (tokens are generated sequentially) and model size.
Use the smallest model that passes your eval set. Route easy requests (classification) to a cheap model, hard ones (complex reasoning) to a strong one — a cheap router prompt decides (Module 4).
max_tokens sensibly — shorter answers are cheaper and fasterMany providers cache stable prompt prefixes (system prompt, few-shot block, large context). Put the static content first, dynamic content last so the prefix is reused — large cost/latency savings on repeated calls.
Quality
/ \
/ \
Cost ───── Latency
You're always balancing three. Make the trade-off deliberately and measured (Module 5 evals tell you whether a cheaper/faster config still passes).
max_tokens capped; output format terseFinal principle of the course: A production prompt is judged on four axes together — accuracy, reliability, cost, and latency. Engineering only the first is a prototype, not a product.