The four levers that change model behaviour without changing your prompt text.
Models don't see characters or words — they see tokens (sub-word chunks). Roughly:
"prompt engineering" ≈ 2–3 tokensYou are billed per token (input + output), and limits are in tokens. Estimating token count is a daily skill.
The context window is the maximum number of tokens (prompt + response) the model can attend to at once — e.g. 8K, 128K, 1M depending on the model.
Consequences:
Controls randomness of sampling:
| Temperature | Behaviour | Use for |
|---|---|---|
0 | Deterministic, picks most likely token | Extraction, classification, code |
0.7 | Balanced creativity | Chat, drafting |
1.0+ | Highly varied, riskier | Brainstorming, ideation |
For anything where you'd write a unit test, use temperature 0.
Instead of capping randomness directly, top_p restricts sampling to the smallest set of tokens whose cumulative probability ≥ p. top_p = 0.1 ≈ very focused. Usually tune temperature or top_p, not both.
max_tokens caps the response length (and cost). Too low → truncated JSON."\n\n"), useful for structured output.Deterministic task? → temperature 0
Need variety? → temperature 0.7–1.0
Output cut off? → raise max_tokens
Big document? → check context window → chunk/RAG