A repetition penalty is a decoding-time control that reduces the probability of generating tokens (or n-grams) that have already appeared, helping prevent LLM outputs from looping or repeating phrases.
What is Repetition Penalty?
Autoregressive models can enter feedback loops where recently used tokens remain highly probable. A repetition penalty adjusts logits so previously seen tokens are less likely. Related variants include presence penalties (penalize any repeated token), frequency penalties (penalize tokens more as they repeat), and n-gram blocking (hard constraints that disallow repeated n-grams).
Where it’s used and why it matters
Repetition penalties are common in chat, summarization, and creative writing to improve readability and reduce user-visible failure modes. They can also reduce token waste. Over-penalizing can suppress important repeated entities or break structured outputs, so tuning matters.
Examples of Repetition Penalty in Practice
- Summaries: reduce repeated boilerplate lines.
- Stories: avoid looping phrasing.
- Chat: reduce repeated stock disclaimers.
FAQs
How is this different from temperature? Temperature changes overall randomness; repetition penalties specifically target reuse of prior tokens.
When should I avoid it? For strict structured outputs where repetition is required; prefer schema constraints.
Does it reduce hallucinations? Not directly—grounding and verification matter more.
How do I tune it? Start mild, test on long outputs, and monitor repetition rate and entity correctness.