16 Chapter 15: Evaluation & Training

17 Introduction

Evaluation is the foundation of AI development–validating model capabilities, guiding training, and ensuring safety. This chapter covers:

Metrics First: How we measure language, vision, and RL–definitions, use cases, pitfalls
Must-Know Datasets: Core benchmarks practitioners encounter in interviews and production
Modern Paradigms: Test-time compute, model-as-judge, contamination
Production Eval: Golden sets, red-teaming, A/B testing, observability

Philosophy: No single metric captures quality. Combine automated metrics (fast, scalable) with human eval (nuanced, expensive). Always validate on held-out data and monitor for distribution drift.

Historical Context–Benchmark Breakthroughs:

2012: AlexNet on ImageNet – 15.3% top-5 error (vs 26% previous SOTA), launched deep learning era
2015: ResNet – 3.6% error on ImageNet (superhuman), residual connections enabled 100+ layer networks
2017: Transformer – Attention mechanism replaced RNNs; BLEU 28.4 on WMT’14 En-De translation
2018: BERT – Pre-training + fine-tuning paradigm; 93.2 on GLUE (near human parity)
2019: GPT-2 – 1.5B params, zero-shot text generation; LAMBADA ppl 8.6 (vs 99.8 for GPT-1)
2020: GPT-3 – 175B params, few-shot learning; 71.8 on MMLU (random chance = 25%)
2022: ChatGPT/GPT-3.5 – RLHF alignment; conversational AI goes mainstream
2023: GPT-4 – Multimodal, 86.4% on MMLU, 67% on HumanEval; contamination concerns emerge
2024-25: Test-time compute scaling – o1 model with chain-of-thought; 83% on AIME (math olympiad)

18 Metrics for Language Models

18.1 Classification Metrics Primer

Before diving into language-specific metrics, recall fundamental classification metrics:

Confusion Matrix:

		Predicted
		Positive	Negative
Actual	Positive	TP	FN
Actual	Negative	FP	TN

Core Metrics: \[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\] \[\text{Precision (TPR)} = \frac{TP}{TP + FP} \quad \text{(Of predicted positives, how many are correct?)}\] \[\text{Recall (Sensitivity)} = \frac{TP}{TP + FN} \quad \text{(Of actual positives, how many did we find?)}\] \[\text{Specificity (TNR)} = \frac{TN}{TN + FP} \quad \text{(Of actual negatives, how many did we correctly reject?)}\] \[\text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}\]

F-beta Score: Generalized F1 with adjustable precision/recall weight: \[F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}\]

\(\beta < 1\): Favor precision (e.g., \(F_{0.5}\) for spam detection–minimize false alarms)
\(\beta = 1\): Balanced (F1)
\(\beta > 1\): Favor recall (e.g., \(F_2\) for medical screening–minimize missed cases)

Key Trade-offs:

Precision vs Recall: Precision-recall curve; high precision → low false positives, high recall → low false negatives
Accuracy misleading with imbalance: 99% negative class → always predict negative gives 99% accuracy but useless
ROC-AUC: Area under ROC curve (TPR vs FPR); measures ranking quality across thresholds
PR-AUC: Area under precision-recall curve; better for imbalanced datasets

18.2 Perplexity

Definition: Exponentiated average negative log-likelihood: \[\text{PPL} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log p(x_i | x_{<i})\right)\]

Interpretation:

Measures how "surprised" the model is by the next token
Lower = better (model assigns higher probability to ground truth)
PPL of 10 means model is as confused as if choosing uniformly from 10 tokens

Use Cases:

Pre-training validation metric (Wikitext, Penn Treebank)
Comparing language models on held-out data
Detecting out-of-distribution text

Limitations:

Doesn’t measure generation quality (factuality, coherence, helpfulness)
Can be gamed by memorization
Sensitive to tokenization (different tokenizers → incomparable PPL)
Critical: Low perplexity \(\neq\) good generation

18.3 BLEU (BiLingual Evaluation Understudy)

Definition: Precision-based n-gram overlap with reference(s): \[\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{4} \frac{1}{4} \log p_n\right)\] where:

\(p_n\): Modified n-gram precision (clipping to avoid reward for repetition)
BP: Brevity penalty \(= \min(1, e^{1 - r/c})\), \(r\) = reference length, \(c\) = candidate length

Note

Modified n-gram precision (clipping): Prevents rewarding repetitive text.

Problem: Without clipping, "the the the the" scores 100% precision if "the" appears in reference.

Solution: Count each n-gram in candidate up to its max occurrences in any reference. \[p_n = \frac{\sum_{\text{n-gram}} \min(\text{Count}_{\text{cand}}, \max_{\text{ref}} \text{Count}_{\text{ref}})}{\sum_{\text{n-gram}} \text{Count}_{\text{cand}}}\]

Example:

Reference: "the cat is on the mat" (2 occurrences of "the")
Candidate: "the the the the the the" (6 occurrences)
Without clipping: precision = 6/6 = 100%
With clipping: count = min(6, 2) = 2 → precision = 2/6 = 33%

Range: 0 to 1 (often scaled to 0-100)

Use Cases:

Machine translation (original use case)
Image captioning, code generation

Limitations:

Only measures precision (not recall)–misses coverage
Lexical matching (synonyms treated as wrong)
Poor correlation with human judgment for long-form text
Requires reference translations

18.4 ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Variants:

ROUGE-N: N-gram recall (ROUGE-1, ROUGE-2) \[\text{ROUGE-N} = \frac{\sum_{\text{n-gram} \in \text{ref}} \text{Count}_{\text{match}}(\text{n-gram})}{\sum_{\text{n-gram} \in \text{ref}} \text{Count}(\text{n-gram})}\]
ROUGE-L: Longest common subsequence (fluency/order)

Key Difference from BLEU: ROUGE measures recall (how much of reference appears in generation), BLEU measures precision (how much of generation appears in reference).

Use Cases:

Summarization (CNN/DailyMail, XSum)
Dialogue response evaluation

Limitations:

Recall-focused (doesn’t penalize hallucinations)
Lexical overlap only (no semantics)

18.5 BERTScore

Definition: Semantic similarity via contextual embeddings: \[\text{BERTScore}_{\text{F1}} = \frac{1}{|x|} \sum_{x_i \in x} \max_{y_j \in y} \text{cosine}(\mathbf{h}_{x_i}, \mathbf{h}_{y_j})\] where \(\mathbf{h}\) are BERT (or similar) embeddings.

Advantages:

Captures semantic similarity (synonyms scored correctly)
Better human correlation than BLEU/ROUGE

Limitations:

Computationally expensive
Sensitive to embedding model choice

18.6 Human Evaluation

Common Criteria:

Fluency: Grammatical correctness, naturalness
Coherence: Logical flow, topic consistency
Relevance: Addresses the prompt?
Factuality: Accurate claims?
Helpfulness: Satisfies user intent?
Harmlessness: Avoids toxic/offensive content

Annotation Protocols:

Likert scales (1-5 or 1-7)
Pairwise comparisons (A vs B)
Elo ratings (aggregated pairwise wins)

Challenges:

Expensive and slow
Annotator disagreement (inter-rater reliability)
Subjective criteria (cultural biases)

19 Key LLM Benchmarks

19.1 Reasoning & Math

GSM8K (Grade School Math):

8.5K grade-school math word problems
Metric: Exact match on final numerical answer
Multi-step reasoning; benefits from chain-of-thought prompting
Key insight: Test-time compute scaling improves accuracy
SOTA progression: GPT-3 (17%) → GPT-3.5 (57%) → GPT-4 (92%) → o1 (95%+)

MATH:

Competition-level math (AMC, AIME)
5 difficulty levels, 7 subjects
SOTA: GPT-3 (5%) → Minerva (50%) → GPT-4 (52%) → o1 (85%+)
Human olympiad contestants: 90%

HumanEval:

164 Python programming problems with unit tests
Metric: pass@k (probability \(\geq\) 1 of \(k\) samples passes)
Tests code synthesis and functional correctness
SOTA pass@1: Codex (29%) → GPT-3.5 (48%) → GPT-4 (67%) → Claude 3.5 (92%)

19.2 Knowledge & Understanding

MMLU (Massive Multitask Language Understanding):

57 subjects (STEM, humanities, social sciences, professional)
15K multiple-choice questions (high school to professional)
Tests breadth of knowledge and zero-shot reasoning
Gold standard for LLM capability comparison
SOTA: Random (25%) → GPT-3 (43%) → GPT-3.5 (70%) → GPT-4 (86%) → Gemini 1.5 Pro (90%)

TruthfulQA:

817 questions eliciting common misconceptions
Tests hallucination tendency
Metric: % truthful and informative answers
Challenge: Larger models often worse (GPT-3: 28%, GPT-4: 59%); RLHF helps

19.3 Long-Context

RULER (Needle-in-Haystack):

Tests retrieval from contexts up to 128K tokens
Place "needle" at different depths, measure recall
Exposes position bias (models worse at middle)

20 Vision-Language Model (VLM) Metrics & Benchmarks

20.1 Metrics for Vision

20.1.1 FID (Fréchet Inception Distance)

Definition: Distance between feature distributions of real vs generated images: \[\text{FID} = \|\mu_r - \mu_g\|^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})\] where \(\mu, \Sigma\) are mean/covariance of Inception-v3 features.

Use Cases:

GANs, diffusion models
Lower FID = better quality/diversity

Limitations:

Requires 10K+ images for stability
Inception bias (ImageNet features)
Doesn’t capture fine-grained artifacts

20.1.2 CLIP Score

Definition: Cosine similarity between CLIP embeddings: \[\text{CLIPScore}(I, c) = \text{cosine}(\text{CLIP}_{\text{img}}(I), \text{CLIP}_{\text{text}}(c))\]

Use Cases:

Text-to-image generation (prompt adherence)
Image captioning (semantic alignment)

Advantage: Reference-free (no ground truth needed)

Limitation: CLIP biases from web data

20.1.3 CIDEr & SPICE

CIDEr: TF-IDF weighted n-gram overlap (emphasizes descriptive terms)

SPICE: Scene graph F1 (parses captions into objects/attributes/relationships)

Better semantic capture than n-gram metrics
Standard for COCO Captions benchmark

20.2 Key VLM Benchmarks

COCO Captions:

330K images, 5 captions each
Metrics: BLEU-4, CIDEr, SPICE
Breakthrough: Show and Tell (2015, CIDEr 94) → CLIP-based models (2021, CIDEr 140+)

VQA v2:

1M+ questions on 200K images
Balanced to reduce language bias
SOTA: BERT-based (2019, 71%) → CLIP+GPT (2021, 78%) → GPT-4V (2023, 77%)

MMMU:

College-level problems across 30+ subjects
Diagrams, charts, scientific figures
SOTA: Random (26%) → GPT-4V (56%) → Gemini 1.5 Pro (62%), Human expert: 89%

21 Reinforcement Learning Evaluation

21.1 RL Metrics

21.1.1 Return (Cumulative Reward)

Definition: \[G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}\] where \(\gamma\) is discount factor (typically 0.99).

Aggregation:

Mean return over evaluation episodes
Median (robust to outliers)
Interquartile mean (IQM): mean of middle 50%

21.1.2 Human-Normalized Score

Definition: \[\text{Score} = \frac{R_{\text{agent}} - R_{\text{random}}}{R_{\text{human}} - R_{\text{random}}}\]

Interpretation:

0 = random policy, 1 = human-level, \(>1\) = superhuman
Allows cross-game comparison (Atari benchmarks)

21.2 Key RL Benchmarks

Atari 2600:

57 games (Pong, Breakout, Space Invaders)
Standard for deep RL (DQN, Rainbow, PPO)
Atari 100K: Sample efficiency (100K environment steps)
Milestones: DQN (2013, human-level) → Rainbow (2018, 230% human) → MuZero (2020, 350%+)

MuJoCo:

Continuous control: Hopper, Walker2d, Ant, Humanoid
Tests continuous control algorithms (SAC, TD3, PPO)
Breakthroughs: TRPO (2015) → PPO (2017) → SAC (2018, SOTA on most tasks)

DMControl Suite:

30+ tasks, pixel-based and state-based
More diverse than MuJoCo

22 Modern Evaluation Paradigms

22.1 Test-Time Compute Scaling

Core Idea: Allocate more inference compute to improve performance (alternative to scaling model size).

Techniques:

Best-of-N: Generate \(N\) samples, select best (by reward model or heuristic)
Chain-of-Thought (CoT): Prompt for reasoning steps before answer
Self-Consistency: Multiple CoT paths with majority vote
Tree-of-Thoughts: Explore multiple reasoning branches, backtrack

Impact: 5-50% accuracy gain on reasoning tasks (GSM8K, MATH)

Trade-offs:

Latency: More compute → slower responses
Cost: Each sample costs API calls or GPU time
Diminishing returns: Gains plateau after \(N \sim 100\)

22.2 Model-as-Judge

Setup: Use LLM (e.g., GPT-4) to score outputs instead of human eval.

Pairwise Comparison:

Present outputs A vs B, ask which is better
Aggregate via Elo ratings or win rate

Challenges:

Position bias: Favors first option
Self-preference: Favors own outputs
Length bias: Favors longer responses

Mitigation:

Swap order (A/B → B/A), aggregate
Use third-party judge

Benchmarks:

Chatbot Arena: User battles with Elo ratings (100K+ battles)
MT-Bench: 80 multi-turn questions, GPT-4 judge (correlates r > 0.9 with human)

22.3 Contamination & Data Leakage

Problem: Training data contains test examples → inflated performance.

Detection:

N-gram overlap between train and test
Embedding similarity thresholds

Mitigation:

Held-out test sets (never released)
Dynamic benchmarks (generated on-the-fly)
Document decontamination in model cards

23 Production Evaluation

23.1 Golden Sets & Test Suites

Composition:

Regression tests: Previous bugs that must not reoccur
Adversarial examples: Known failure modes (prompt injections)
Edge cases: Multilingual, code-switching
Domain-specific: Medical disclaimers, PII redaction

Best Practice:

100-1K curated examples
CI/CD integration: block deploys if accuracy drops
Continuously update with production failures

23.2 Red-Teaming

Goal: Adversarial testing for safety/security vulnerabilities.

Approaches:

Manual: Security experts craft adversarial prompts
Automated: RL agents trained to elicit harmful outputs
Crowdsourced: Platform users paid to break model

Attack Vectors:

Jailbreaks (bypass safety guardrails)
Prompt injections (ignore system instructions)
Toxicity elicitation
PII extraction

Metric: Attack success rate (ASR)

23.3 A/B Testing & Online Evaluation

Setup: Deploy model variant to subset of users, measure real-world impact.

Metrics:

Engagement: Click-through rate, session length, retention
Task success: Completion rate, user satisfaction
Guardrail violations: Toxicity reports, user blocks

Statistical Rigor:

Minimum detectable effect (MDE)
Multiple testing correction (Bonferroni)
Run 1-2 weeks (avoid novelty effect)

23.4 Observability & Monitoring

Real-Time Metrics:

Latency: P50, P95, P99
Throughput: Requests/sec, tokens/sec
Error rate: 5xx, timeouts, OOM

Quality Metrics:

Output length distribution
Refusal rate
Toxicity classifier scores

24 Interview Cheat Sheet

24.1 Key Datasets to Know

LLM Text:

Reasoning: GSM8K (grade-school math), MATH (competition math), HumanEval (code)
Knowledge: MMLU (57 subjects, multiple choice)
QA: SQuAD, Natural Questions, TriviaQA
Safety: TruthfulQA (hallucinations), ToxiGen (toxicity), BBQ (bias)
Long-context: RULER (needle-in-haystack), LongBench

VLM:

Captioning: COCO Captions, Nocaps
VQA: VQA v2, GQA (reasoning), TextVQA (OCR)
Multimodal: MMMU (college-level), MMBench

RL:

Games: Atari 2600, Procgen
Control: MuJoCo (Hopper, Walker, Ant), DMControl
Multi-task: Meta-World (50 robotic tasks)

Metrics Summary

Metric	Use Case	Key Property
Perplexity	Language modeling	Exponentiated NLL
BLEU	Translation, captioning	Precision-based n-gram overlap
ROUGE	Summarization	Recall-based n-gram overlap
BERTScore	Semantic similarity	Contextual embedding matching
CIDEr	Image captioning	TF-IDF weighted n-grams
SPICE	Caption semantic content	Scene graph F1
FID	Image generation quality	Fréchet distance of Inception features
CLIP Score	Text-image alignment	CLIP embedding similarity
Return	RL episode performance	Cumulative discounted reward
Human-norm	RL cross-game comparison	Scaled by random/human baseline

24.2 Common Interview Questions

“What’s the difference between BLEU and ROUGE?”
BLEU measures precision (how much of generation appears in reference), ROUGE measures recall (how much of reference appears in generation). BLEU penalizes brevity; ROUGE is more forgiving. BLEU for translation, ROUGE for summarization.

“Why is perplexity not enough to evaluate LLMs?”
Perplexity measures surprisal on next-token prediction, not generation quality. A model can have low perplexity (good at predicting) but generate factually incorrect or unhelpful text. Doesn’t capture coherence, factuality, or alignment.

“How do you evaluate a text-to-image model?”
Quantitative: FID (distribution match), CLIP Score (prompt adherence), Inception Score (quality+diversity). Qualitative: Human ratings on photorealism, prompt fidelity, diversity. A/B testing in production.

“What is test-time compute scaling?”
Allocating more inference compute (e.g., best-of-N sampling, chain-of-thought, self-consistency) to improve performance. Alternative to scaling model size. Effective for reasoning tasks (GSM8K, MATH) but increases latency/cost.

“How do you detect benchmark contamination?”
N-gram overlap between train and test, embedding similarity, substring matching. Mitigation: use held-out test sets, dynamic benchmarks (generated on-demand), report decontamination procedures in model cards.

“What is model-as-judge, and what are its limitations?”
Use LLM (e.g., GPT-4) to score/rank outputs instead of human eval. Faster and cheaper. Limitations: position bias (favors first option), self-preference (favors own outputs), length bias (favors longer), may miss nuanced errors.

“Describe red-teaming for LLMs.”
Adversarial testing to find safety/security vulnerabilities. Manual (experts craft jailbreaks), automated (RL agents trained to elicit harmful outputs), crowdsourced. Metrics: attack success rate, time-to-break. Critical for pre-deployment safety validation.

“How do you evaluate long-context models?”
Needle-in-haystack (RULER): place fact at different depths in 128K context, measure recall. Position bias analysis (models worse at middle). Long-form QA/summarization benchmarks (LongBench). Check for context length extrapolation failures.

24.3 Production Best Practices

Golden Sets:

100-1K curated examples covering critical failure modes
Regression tests + adversarial + edge cases + domain-specific
CI/CD integration: block deploys if golden set accuracy drops
Continuously update with production failures

A/B Testing:

5-10% traffic to new model, measure engagement/satisfaction
Statistical significance: MDE, multiple testing correction
Monitor for Simpson’s paradox (subgroup effects)
Run for 1-2 weeks (avoid novelty effect bias)

Note

Simpson’s Paradox in A/B Testing

A trend can hold in every subgroup but reverse in aggregate (or vice versa). Example:

Old Model:

Medical queries (100 users): 90 satisfied → 90% satisfaction
General queries (900 users): 450 satisfied → 50% satisfaction
Overall: 540/1000 = 54% satisfaction

New Model:

Medical queries (900 users): 720 satisfied → 80% satisfaction (worse than 90%)
General queries (100 users): 40 satisfied → 40% satisfaction (worse than 50%)
Overall: 760/1000 = 76% satisfaction (better!)

How? Subgroup sizes flipped–new model received more traffic from high-performing category (medical), even though it’s worse at both tasks. Aggregate metric misleads.

Lesson: Always segment by user type, query category, language. Overall metrics can hide regressions in critical subpopulations.

Observability:

Real-time: P95/P99 latency, error rate, throughput
Quality: Output length, refusal rate, toxicity scores
Drift: Input/output distribution shifts, golden set accuracy
Alerting: Latency spikes, error rate \(>\) threshold, golden set failures

25 Summary

Evaluation Philosophy:

No single metric captures all aspects of model quality
Combine automated metrics (fast, scalable) with human eval (nuanced, expensive)
Test-time compute scaling increasingly important (CoT, self-consistency, best-of-N)
Contamination/leakage undermines benchmark validity–use dynamic evals

Key Trade-offs:

Correlation vs cost: Human eval gold standard but expensive; model-as-judge cheaper but biased
Static vs dynamic: Static benchmarks saturate; dynamic benchmarks harder to compare across time
In-distribution vs OOD: High performance on benchmarks \(\neq\) robust in production

Emerging Trends:

Test-time compute scaling laws (inference-time reasoning)
Agentic evaluation (models solve tasks via tool use, not just text generation)
Multimodal benchmarks (MMMU, video understanding)
Safety-first evaluation (red-teaming, jailbreak robustness, alignment)