16 Chapter 15: Evaluation & Training
17 Introduction
Evaluation is the foundation of AI development–validating model capabilities, guiding training, and ensuring safety. This chapter covers:
Metrics First: How we measure language, vision, and RL–definitions, use cases, pitfalls
Must-Know Datasets: Core benchmarks practitioners encounter in interviews and production
Modern Paradigms: Test-time compute, model-as-judge, contamination
Production Eval: Golden sets, red-teaming, A/B testing, observability
Philosophy: No single metric captures quality. Combine automated metrics (fast, scalable) with human eval (nuanced, expensive). Always validate on held-out data and monitor for distribution drift.
Historical Context–Benchmark Breakthroughs:
2012: AlexNet on ImageNet – 15.3% top-5 error (vs 26% previous SOTA), launched deep learning era
2015: ResNet – 3.6% error on ImageNet (superhuman), residual connections enabled 100+ layer networks
2017: Transformer – Attention mechanism replaced RNNs; BLEU 28.4 on WMT’14 En-De translation
2018: BERT – Pre-training + fine-tuning paradigm; 93.2 on GLUE (near human parity)
2019: GPT-2 – 1.5B params, zero-shot text generation; LAMBADA ppl 8.6 (vs 99.8 for GPT-1)
2020: GPT-3 – 175B params, few-shot learning; 71.8 on MMLU (random chance = 25%)
2022: ChatGPT/GPT-3.5 – RLHF alignment; conversational AI goes mainstream
2023: GPT-4 – Multimodal, 86.4% on MMLU, 67% on HumanEval; contamination concerns emerge
2024-25: Test-time compute scaling – o1 model with chain-of-thought; 83% on AIME (math olympiad)
18 Metrics for Language Models
18.1 Classification Metrics Primer
Before diving into language-specific metrics, recall fundamental classification metrics:
Confusion Matrix:
| Predicted | |||
| Positive | Negative | ||
| Actual | Positive | TP | FN |
| Negative | FP | TN | |
Core Metrics: \[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\] \[\text{Precision (TPR)} = \frac{TP}{TP + FP} \quad \text{(Of predicted positives, how many are correct?)}\] \[\text{Recall (Sensitivity)} = \frac{TP}{TP + FN} \quad \text{(Of actual positives, how many did we find?)}\] \[\text{Specificity (TNR)} = \frac{TN}{TN + FP} \quad \text{(Of actual negatives, how many did we correctly reject?)}\] \[\text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}\]
F-beta Score: Generalized F1 with adjustable precision/recall weight: \[F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}\]
\(\beta < 1\): Favor precision (e.g., \(F_{0.5}\) for spam detection–minimize false alarms)
\(\beta = 1\): Balanced (F1)
\(\beta > 1\): Favor recall (e.g., \(F_2\) for medical screening–minimize missed cases)
Key Trade-offs:
Precision vs Recall: Precision-recall curve; high precision → low false positives, high recall → low false negatives
Accuracy misleading with imbalance: 99% negative class → always predict negative gives 99% accuracy but useless
ROC-AUC: Area under ROC curve (TPR vs FPR); measures ranking quality across thresholds
PR-AUC: Area under precision-recall curve; better for imbalanced datasets
18.2 Perplexity
Definition: Exponentiated average negative log-likelihood: \[\text{PPL} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log p(x_i | x_{<i})\right)\]
Interpretation:
Measures how "surprised" the model is by the next token
Lower = better (model assigns higher probability to ground truth)
PPL of 10 means model is as confused as if choosing uniformly from 10 tokens
Use Cases:
Pre-training validation metric (Wikitext, Penn Treebank)
Comparing language models on held-out data
Detecting out-of-distribution text
Limitations:
Doesn’t measure generation quality (factuality, coherence, helpfulness)
Can be gamed by memorization
Sensitive to tokenization (different tokenizers → incomparable PPL)
Critical: Low perplexity \(\neq\) good generation
18.3 BLEU (BiLingual Evaluation Understudy)
Definition: Precision-based n-gram overlap with reference(s): \[\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{4} \frac{1}{4} \log p_n\right)\] where:
\(p_n\): Modified n-gram precision (clipping to avoid reward for repetition)
BP: Brevity penalty \(= \min(1, e^{1 - r/c})\), \(r\) = reference length, \(c\) = candidate length
Modified n-gram precision (clipping): Prevents rewarding repetitive text.
Problem: Without clipping, "the the the the" scores 100% precision if "the" appears in reference.
Solution: Count each n-gram in candidate up to its max occurrences in any reference. \[p_n = \frac{\sum_{\text{n-gram}} \min(\text{Count}_{\text{cand}}, \max_{\text{ref}} \text{Count}_{\text{ref}})}{\sum_{\text{n-gram}} \text{Count}_{\text{cand}}}\]
Example:
Reference: "the cat is on the mat" (2 occurrences of "the")
Candidate: "the the the the the the" (6 occurrences)
Without clipping: precision = 6/6 = 100%
With clipping: count = min(6, 2) = 2 → precision = 2/6 = 33%
Range: 0 to 1 (often scaled to 0-100)
Use Cases:
Machine translation (original use case)
Image captioning, code generation
Limitations:
Only measures precision (not recall)–misses coverage
Lexical matching (synonyms treated as wrong)
Poor correlation with human judgment for long-form text
Requires reference translations
18.4 ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
Variants:
ROUGE-N: N-gram recall (ROUGE-1, ROUGE-2) \[\text{ROUGE-N} = \frac{\sum_{\text{n-gram} \in \text{ref}} \text{Count}_{\text{match}}(\text{n-gram})}{\sum_{\text{n-gram} \in \text{ref}} \text{Count}(\text{n-gram})}\]
ROUGE-L: Longest common subsequence (fluency/order)
Key Difference from BLEU: ROUGE measures recall (how much of reference appears in generation), BLEU measures precision (how much of generation appears in reference).
Use Cases:
Summarization (CNN/DailyMail, XSum)
Dialogue response evaluation
Limitations:
Recall-focused (doesn’t penalize hallucinations)
Lexical overlap only (no semantics)
18.5 BERTScore
Definition: Semantic similarity via contextual embeddings: \[\text{BERTScore}_{\text{F1}} = \frac{1}{|x|} \sum_{x_i \in x} \max_{y_j \in y} \text{cosine}(\mathbf{h}_{x_i}, \mathbf{h}_{y_j})\] where \(\mathbf{h}\) are BERT (or similar) embeddings.
Advantages:
Captures semantic similarity (synonyms scored correctly)
Better human correlation than BLEU/ROUGE
Limitations:
Computationally expensive
Sensitive to embedding model choice
18.6 Human Evaluation
Common Criteria:
Fluency: Grammatical correctness, naturalness
Coherence: Logical flow, topic consistency
Relevance: Addresses the prompt?
Factuality: Accurate claims?
Helpfulness: Satisfies user intent?
Harmlessness: Avoids toxic/offensive content
Annotation Protocols:
Likert scales (1-5 or 1-7)
Pairwise comparisons (A vs B)
Elo ratings (aggregated pairwise wins)
Challenges:
Expensive and slow
Annotator disagreement (inter-rater reliability)
Subjective criteria (cultural biases)
19 Key LLM Benchmarks
19.1 Reasoning & Math
GSM8K (Grade School Math):
8.5K grade-school math word problems
Metric: Exact match on final numerical answer
Multi-step reasoning; benefits from chain-of-thought prompting
Key insight: Test-time compute scaling improves accuracy
SOTA progression: GPT-3 (17%) → GPT-3.5 (57%) → GPT-4 (92%) → o1 (95%+)
MATH:
Competition-level math (AMC, AIME)
5 difficulty levels, 7 subjects
SOTA: GPT-3 (5%) → Minerva (50%) → GPT-4 (52%) → o1 (85%+)
Human olympiad contestants: 90%
HumanEval:
164 Python programming problems with unit tests
Metric: pass@k (probability \(\geq\) 1 of \(k\) samples passes)
Tests code synthesis and functional correctness
SOTA pass@1: Codex (29%) → GPT-3.5 (48%) → GPT-4 (67%) → Claude 3.5 (92%)
19.2 Knowledge & Understanding
MMLU (Massive Multitask Language Understanding):
57 subjects (STEM, humanities, social sciences, professional)
15K multiple-choice questions (high school to professional)
Tests breadth of knowledge and zero-shot reasoning
Gold standard for LLM capability comparison
SOTA: Random (25%) → GPT-3 (43%) → GPT-3.5 (70%) → GPT-4 (86%) → Gemini 1.5 Pro (90%)
TruthfulQA:
817 questions eliciting common misconceptions
Tests hallucination tendency
Metric: % truthful and informative answers
Challenge: Larger models often worse (GPT-3: 28%, GPT-4: 59%); RLHF helps
19.3 Long-Context
RULER (Needle-in-Haystack):
Tests retrieval from contexts up to 128K tokens
Place "needle" at different depths, measure recall
Exposes position bias (models worse at middle)
20 Vision-Language Model (VLM) Metrics & Benchmarks
20.1 Metrics for Vision
20.1.1 FID (Fréchet Inception Distance)
Definition: Distance between feature distributions of real vs generated images: \[\text{FID} = \|\mu_r - \mu_g\|^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})\] where \(\mu, \Sigma\) are mean/covariance of Inception-v3 features.
Use Cases:
GANs, diffusion models
Lower FID = better quality/diversity
Limitations:
Requires 10K+ images for stability
Inception bias (ImageNet features)
Doesn’t capture fine-grained artifacts
20.1.2 CLIP Score
Definition: Cosine similarity between CLIP embeddings: \[\text{CLIPScore}(I, c) = \text{cosine}(\text{CLIP}_{\text{img}}(I), \text{CLIP}_{\text{text}}(c))\]
Use Cases:
Text-to-image generation (prompt adherence)
Image captioning (semantic alignment)
Advantage: Reference-free (no ground truth needed)
Limitation: CLIP biases from web data
20.1.3 CIDEr & SPICE
CIDEr: TF-IDF weighted n-gram overlap (emphasizes descriptive terms)
SPICE: Scene graph F1 (parses captions into objects/attributes/relationships)
Better semantic capture than n-gram metrics
Standard for COCO Captions benchmark
20.2 Key VLM Benchmarks
COCO Captions:
330K images, 5 captions each
Metrics: BLEU-4, CIDEr, SPICE
Breakthrough: Show and Tell (2015, CIDEr 94) → CLIP-based models (2021, CIDEr 140+)
VQA v2:
1M+ questions on 200K images
Balanced to reduce language bias
SOTA: BERT-based (2019, 71%) → CLIP+GPT (2021, 78%) → GPT-4V (2023, 77%)
MMMU:
College-level problems across 30+ subjects
Diagrams, charts, scientific figures
SOTA: Random (26%) → GPT-4V (56%) → Gemini 1.5 Pro (62%), Human expert: 89%
21 Reinforcement Learning Evaluation
21.1 RL Metrics
21.1.1 Return (Cumulative Reward)
Definition: \[G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}\] where \(\gamma\) is discount factor (typically 0.99).
Aggregation:
Mean return over evaluation episodes
Median (robust to outliers)
Interquartile mean (IQM): mean of middle 50%
21.1.2 Human-Normalized Score
Definition: \[\text{Score} = \frac{R_{\text{agent}} - R_{\text{random}}}{R_{\text{human}} - R_{\text{random}}}\]
Interpretation:
0 = random policy, 1 = human-level, \(>1\) = superhuman
Allows cross-game comparison (Atari benchmarks)
21.2 Key RL Benchmarks
Atari 2600:
57 games (Pong, Breakout, Space Invaders)
Standard for deep RL (DQN, Rainbow, PPO)
Atari 100K: Sample efficiency (100K environment steps)
Milestones: DQN (2013, human-level) → Rainbow (2018, 230% human) → MuZero (2020, 350%+)
MuJoCo:
Continuous control: Hopper, Walker2d, Ant, Humanoid
Tests continuous control algorithms (SAC, TD3, PPO)
Breakthroughs: TRPO (2015) → PPO (2017) → SAC (2018, SOTA on most tasks)
DMControl Suite:
30+ tasks, pixel-based and state-based
More diverse than MuJoCo
22 Modern Evaluation Paradigms
22.1 Test-Time Compute Scaling
Core Idea: Allocate more inference compute to improve performance (alternative to scaling model size).
Techniques:
Best-of-N: Generate \(N\) samples, select best (by reward model or heuristic)
Chain-of-Thought (CoT): Prompt for reasoning steps before answer
Self-Consistency: Multiple CoT paths with majority vote
Tree-of-Thoughts: Explore multiple reasoning branches, backtrack
Impact: 5-50% accuracy gain on reasoning tasks (GSM8K, MATH)
Trade-offs:
Latency: More compute → slower responses
Cost: Each sample costs API calls or GPU time
Diminishing returns: Gains plateau after \(N \sim 100\)
22.2 Model-as-Judge
Setup: Use LLM (e.g., GPT-4) to score outputs instead of human eval.
Pairwise Comparison:
Present outputs A vs B, ask which is better
Aggregate via Elo ratings or win rate
Challenges:
Position bias: Favors first option
Self-preference: Favors own outputs
Length bias: Favors longer responses
Mitigation:
Swap order (A/B → B/A), aggregate
Use third-party judge
Benchmarks:
Chatbot Arena: User battles with Elo ratings (100K+ battles)
MT-Bench: 80 multi-turn questions, GPT-4 judge (correlates r > 0.9 with human)
22.3 Contamination & Data Leakage
Problem: Training data contains test examples → inflated performance.
Detection:
N-gram overlap between train and test
Embedding similarity thresholds
Mitigation:
Held-out test sets (never released)
Dynamic benchmarks (generated on-the-fly)
Document decontamination in model cards
23 Production Evaluation
23.1 Golden Sets & Test Suites
Composition:
Regression tests: Previous bugs that must not reoccur
Adversarial examples: Known failure modes (prompt injections)
Edge cases: Multilingual, code-switching
Domain-specific: Medical disclaimers, PII redaction
Best Practice:
100-1K curated examples
CI/CD integration: block deploys if accuracy drops
Continuously update with production failures
23.2 Red-Teaming
Goal: Adversarial testing for safety/security vulnerabilities.
Approaches:
Manual: Security experts craft adversarial prompts
Automated: RL agents trained to elicit harmful outputs
Crowdsourced: Platform users paid to break model
Attack Vectors:
Jailbreaks (bypass safety guardrails)
Prompt injections (ignore system instructions)
Toxicity elicitation
PII extraction
Metric: Attack success rate (ASR)
23.3 A/B Testing & Online Evaluation
Setup: Deploy model variant to subset of users, measure real-world impact.
Metrics:
Engagement: Click-through rate, session length, retention
Task success: Completion rate, user satisfaction
Guardrail violations: Toxicity reports, user blocks
Statistical Rigor:
Minimum detectable effect (MDE)
Multiple testing correction (Bonferroni)
Run 1-2 weeks (avoid novelty effect)
23.4 Observability & Monitoring
Real-Time Metrics:
Latency: P50, P95, P99
Throughput: Requests/sec, tokens/sec
Error rate: 5xx, timeouts, OOM
Quality Metrics:
Output length distribution
Refusal rate
Toxicity classifier scores
24 Interview Cheat Sheet
24.1 Key Datasets to Know
LLM Text:
Reasoning: GSM8K (grade-school math), MATH (competition math), HumanEval (code)
Knowledge: MMLU (57 subjects, multiple choice)
QA: SQuAD, Natural Questions, TriviaQA
Safety: TruthfulQA (hallucinations), ToxiGen (toxicity), BBQ (bias)
Long-context: RULER (needle-in-haystack), LongBench
VLM:
Captioning: COCO Captions, Nocaps
VQA: VQA v2, GQA (reasoning), TextVQA (OCR)
Multimodal: MMMU (college-level), MMBench
RL:
Games: Atari 2600, Procgen
Control: MuJoCo (Hopper, Walker, Ant), DMControl
Multi-task: Meta-World (50 robotic tasks)
| Metric | Use Case | Key Property |
|---|---|---|
| Perplexity | Language modeling | Exponentiated NLL |
| BLEU | Translation, captioning | Precision-based n-gram overlap |
| ROUGE | Summarization | Recall-based n-gram overlap |
| BERTScore | Semantic similarity | Contextual embedding matching |
| CIDEr | Image captioning | TF-IDF weighted n-grams |
| SPICE | Caption semantic content | Scene graph F1 |
| FID | Image generation quality | Fréchet distance of Inception features |
| CLIP Score | Text-image alignment | CLIP embedding similarity |
| Return | RL episode performance | Cumulative discounted reward |
| Human-norm | RL cross-game comparison | Scaled by random/human baseline |
24.2 Common Interview Questions
“What’s the difference between BLEU and ROUGE?”
BLEU measures precision (how much of generation appears in reference), ROUGE measures recall (how much of reference appears in generation). BLEU penalizes brevity; ROUGE is more forgiving. BLEU for translation, ROUGE for summarization.
“Why is perplexity not enough to evaluate LLMs?”
Perplexity measures surprisal on next-token prediction, not generation quality. A model can have low perplexity (good at predicting) but generate factually incorrect or unhelpful text. Doesn’t capture coherence, factuality, or alignment.
“How do you evaluate a text-to-image model?”
Quantitative: FID (distribution match), CLIP Score (prompt adherence), Inception Score (quality+diversity). Qualitative: Human ratings on photorealism, prompt fidelity, diversity. A/B testing in production.
“What is test-time compute scaling?”
Allocating more inference compute (e.g., best-of-N sampling, chain-of-thought, self-consistency) to improve performance. Alternative to scaling model size. Effective for reasoning tasks (GSM8K, MATH) but increases latency/cost.
“How do you detect benchmark contamination?”
N-gram overlap between train and test, embedding similarity, substring matching. Mitigation: use held-out test sets, dynamic benchmarks (generated on-demand), report decontamination procedures in model cards.
“What is model-as-judge, and what are its limitations?”
Use LLM (e.g., GPT-4) to score/rank outputs instead of human eval. Faster and cheaper. Limitations: position bias (favors first option), self-preference (favors own outputs), length bias (favors longer), may miss nuanced errors.
“Describe red-teaming for LLMs.”
Adversarial testing to find safety/security vulnerabilities. Manual (experts craft jailbreaks), automated (RL agents trained to elicit harmful outputs), crowdsourced. Metrics: attack success rate, time-to-break. Critical for pre-deployment safety validation.
“How do you evaluate long-context models?”
Needle-in-haystack (RULER): place fact at different depths in 128K context, measure recall. Position bias analysis (models worse at middle). Long-form QA/summarization benchmarks (LongBench). Check for context length extrapolation failures.
24.3 Production Best Practices
Golden Sets:
100-1K curated examples covering critical failure modes
Regression tests + adversarial + edge cases + domain-specific
CI/CD integration: block deploys if golden set accuracy drops
Continuously update with production failures
A/B Testing:
5-10% traffic to new model, measure engagement/satisfaction
Statistical significance: MDE, multiple testing correction
Monitor for Simpson’s paradox (subgroup effects)
Run for 1-2 weeks (avoid novelty effect bias)
Simpson’s Paradox in A/B Testing
A trend can hold in every subgroup but reverse in aggregate (or vice versa). Example:
Old Model:
Medical queries (100 users): 90 satisfied → 90% satisfaction
General queries (900 users): 450 satisfied → 50% satisfaction
Overall: 540/1000 = 54% satisfaction
New Model:
Medical queries (900 users): 720 satisfied → 80% satisfaction (worse than 90%)
General queries (100 users): 40 satisfied → 40% satisfaction (worse than 50%)
Overall: 760/1000 = 76% satisfaction (better!)
How? Subgroup sizes flipped–new model received more traffic from high-performing category (medical), even though it’s worse at both tasks. Aggregate metric misleads.
Lesson: Always segment by user type, query category, language. Overall metrics can hide regressions in critical subpopulations.
Observability:
Real-time: P95/P99 latency, error rate, throughput
Quality: Output length, refusal rate, toxicity scores
Drift: Input/output distribution shifts, golden set accuracy
Alerting: Latency spikes, error rate \(>\) threshold, golden set failures
25 Summary
Evaluation Philosophy:
No single metric captures all aspects of model quality
Combine automated metrics (fast, scalable) with human eval (nuanced, expensive)
Test-time compute scaling increasingly important (CoT, self-consistency, best-of-N)
Contamination/leakage undermines benchmark validity–use dynamic evals
Key Trade-offs:
Correlation vs cost: Human eval gold standard but expensive; model-as-judge cheaper but biased
Static vs dynamic: Static benchmarks saturate; dynamic benchmarks harder to compare across time
In-distribution vs OOD: High performance on benchmarks \(\neq\) robust in production
Emerging Trends:
Test-time compute scaling laws (inference-time reasoning)
Agentic evaluation (models solve tasks via tool use, not just text generation)
Multimodal benchmarks (MMMU, video understanding)
Safety-first evaluation (red-teaming, jailbreak robustness, alignment)