Evals, short for LLM evaluations, are systematic tests that measure how well a language model or an LLM application performs on defined tasks, using automated metrics, graded rubrics, or human review to track quality, safety, and reliability.
What is Evals (LLM Evaluation)?
LLM evaluation is the process of turning a vague goal like better answers into measurable criteria. An eval set typically includes representative prompts, expected outputs, and a scoring method. For some tasks, exact match or structured checks are possible, such as verifying JSON validity or whether a tool call used the correct arguments. For open ended generation, evaluators may use semantic similarity, classifier based checks, or an LLM as a judge with a clear rubric.
Evals can be offline or online. Offline evals run on a fixed dataset to compare model versions, prompts, and retrieval pipelines before deployment. Online evals sample production traffic and combine user feedback, human review, and automated checks. Good evals are versioned, reproducible, and cover success and failure cases, including adversarial prompts, policy violations, and edge cases for long context or retrieval.
Where it is used and why it matters
Evals are used in prompt engineering, model selection, RAG tuning, and agent tool use testing. They matter because LLM systems can regress when you change a prompt, upgrade a model, adjust chunking, or modify tool schemas. Evals provide a quality gate for releases, help detect drift, and make tradeoffs explicit, such as accepting slightly lower creativity to gain factuality.
Types
- Task accuracy evals, such as QA correctness, classification F1, or tool call success rate.
- Safety evals, such as jailbreak resistance, toxicity, and privacy leakage checks.
- RAG evals, such as context relevance, citation support, and answer groundedness.
FAQs
1. What makes an eval reliable?
A reliable eval has representative data, clear rubrics, stable scoring, and is sensitive enough to detect meaningful regressions.
2. Can I use an LLM as a judge?
Yes, but you should define a rubric, control the judge model and prompt, and validate with human review to avoid bias or inconsistency.
3. How large should an eval set be?
Start with 50 to 200 high quality examples, then expand with real traffic and failure cases as your product evolves.
4. How do evals relate to A B testing?
Evals are controlled tests, while A B tests measure impact on real users. Many teams use evals to filter candidates before A B testing.