16  Chapter 15: Evaluation & Training

17 Introduction

Evaluation is the foundation of AI development–validating model capabilities, guiding training, and ensuring safety. This chapter covers:

  • Metrics First: How we measure language, vision, and RL–definitions, use cases, pitfalls

  • Must-Know Datasets: Core benchmarks practitioners encounter in interviews and production

  • Modern Paradigms: Test-time compute, model-as-judge, contamination

  • Production Eval: Golden sets, red-teaming, A/B testing, observability

Philosophy: No single metric captures quality. Combine automated metrics (fast, scalable) with human eval (nuanced, expensive). Always validate on held-out data and monitor for distribution drift.

Historical Context–Benchmark Breakthroughs:

  • 2012: AlexNet on ImageNet – 15.3% top-5 error (vs 26% previous SOTA), launched deep learning era

  • 2015: ResNet – 3.6% error on ImageNet (superhuman), residual connections enabled 100+ layer networks

  • 2017: Transformer – Attention mechanism replaced RNNs; BLEU 28.4 on WMT’14 En-De translation

  • 2018: BERT – Pre-training + fine-tuning paradigm; 93.2 on GLUE (near human parity)

  • 2019: GPT-2 – 1.5B params, zero-shot text generation; LAMBADA ppl 8.6 (vs 99.8 for GPT-1)

  • 2020: GPT-3 – 175B params, few-shot learning; 71.8 on MMLU (random chance = 25%)

  • 2022: ChatGPT/GPT-3.5 – RLHF alignment; conversational AI goes mainstream

  • 2023: GPT-4 – Multimodal, 86.4% on MMLU, 67% on HumanEval; contamination concerns emerge

  • 2024-25: Test-time compute scaling – o1 model with chain-of-thought; 83% on AIME (math olympiad)

18 Metrics for Language Models

18.1 Classification Metrics Primer

Before diving into language-specific metrics, recall fundamental classification metrics:

Confusion Matrix:

Predicted
Positive Negative
Actual Positive TP FN
Negative FP TN

Core Metrics: \[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\] \[\text{Precision (TPR)} = \frac{TP}{TP + FP} \quad \text{(Of predicted positives, how many are correct?)}\] \[\text{Recall (Sensitivity)} = \frac{TP}{TP + FN} \quad \text{(Of actual positives, how many did we find?)}\] \[\text{Specificity (TNR)} = \frac{TN}{TN + FP} \quad \text{(Of actual negatives, how many did we correctly reject?)}\] \[\text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}\]

F-beta Score: Generalized F1 with adjustable precision/recall weight: \[F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}\]

  • \(\beta < 1\): Favor precision (e.g., \(F_{0.5}\) for spam detection–minimize false alarms)

  • \(\beta = 1\): Balanced (F1)

  • \(\beta > 1\): Favor recall (e.g., \(F_2\) for medical screening–minimize missed cases)

Key Trade-offs:

  • Precision vs Recall: Precision-recall curve; high precision → low false positives, high recall → low false negatives

  • Accuracy misleading with imbalance: 99% negative class → always predict negative gives 99% accuracy but useless

  • ROC-AUC: Area under ROC curve (TPR vs FPR); measures ranking quality across thresholds

  • PR-AUC: Area under precision-recall curve; better for imbalanced datasets

18.2 Perplexity

Definition: Exponentiated average negative log-likelihood: \[\text{PPL} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log p(x_i | x_{<i})\right)\]

Interpretation:

  • Measures how "surprised" the model is by the next token

  • Lower = better (model assigns higher probability to ground truth)

  • PPL of 10 means model is as confused as if choosing uniformly from 10 tokens

Use Cases:

  • Pre-training validation metric (Wikitext, Penn Treebank)

  • Comparing language models on held-out data

  • Detecting out-of-distribution text

Limitations:

  • Doesn’t measure generation quality (factuality, coherence, helpfulness)

  • Can be gamed by memorization

  • Sensitive to tokenization (different tokenizers → incomparable PPL)

  • Critical: Low perplexity \(\neq\) good generation

18.3 BLEU (BiLingual Evaluation Understudy)

Definition: Precision-based n-gram overlap with reference(s): \[\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{4} \frac{1}{4} \log p_n\right)\] where:

  • \(p_n\): Modified n-gram precision (clipping to avoid reward for repetition)

  • BP: Brevity penalty \(= \min(1, e^{1 - r/c})\), \(r\) = reference length, \(c\) = candidate length

Note

Modified n-gram precision (clipping): Prevents rewarding repetitive text.

Problem: Without clipping, "the the the the" scores 100% precision if "the" appears in reference.

Solution: Count each n-gram in candidate up to its max occurrences in any reference. \[p_n = \frac{\sum_{\text{n-gram}} \min(\text{Count}_{\text{cand}}, \max_{\text{ref}} \text{Count}_{\text{ref}})}{\sum_{\text{n-gram}} \text{Count}_{\text{cand}}}\]

Example:

  • Reference: "the cat is on the mat" (2 occurrences of "the")

  • Candidate: "the the the the the the" (6 occurrences)

  • Without clipping: precision = 6/6 = 100%

  • With clipping: count = min(6, 2) = 2 → precision = 2/6 = 33%

Range: 0 to 1 (often scaled to 0-100)

Use Cases:

  • Machine translation (original use case)

  • Image captioning, code generation

Limitations:

  • Only measures precision (not recall)–misses coverage

  • Lexical matching (synonyms treated as wrong)

  • Poor correlation with human judgment for long-form text

  • Requires reference translations

18.4 ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Variants:

  • ROUGE-N: N-gram recall (ROUGE-1, ROUGE-2) \[\text{ROUGE-N} = \frac{\sum_{\text{n-gram} \in \text{ref}} \text{Count}_{\text{match}}(\text{n-gram})}{\sum_{\text{n-gram} \in \text{ref}} \text{Count}(\text{n-gram})}\]

  • ROUGE-L: Longest common subsequence (fluency/order)

Key Difference from BLEU: ROUGE measures recall (how much of reference appears in generation), BLEU measures precision (how much of generation appears in reference).

Use Cases:

  • Summarization (CNN/DailyMail, XSum)

  • Dialogue response evaluation

Limitations:

  • Recall-focused (doesn’t penalize hallucinations)

  • Lexical overlap only (no semantics)

18.5 BERTScore

Definition: Semantic similarity via contextual embeddings: \[\text{BERTScore}_{\text{F1}} = \frac{1}{|x|} \sum_{x_i \in x} \max_{y_j \in y} \text{cosine}(\mathbf{h}_{x_i}, \mathbf{h}_{y_j})\] where \(\mathbf{h}\) are BERT (or similar) embeddings.

Advantages:

  • Captures semantic similarity (synonyms scored correctly)

  • Better human correlation than BLEU/ROUGE

Limitations:

  • Computationally expensive

  • Sensitive to embedding model choice

18.6 Human Evaluation

Common Criteria:

  • Fluency: Grammatical correctness, naturalness

  • Coherence: Logical flow, topic consistency

  • Relevance: Addresses the prompt?

  • Factuality: Accurate claims?

  • Helpfulness: Satisfies user intent?

  • Harmlessness: Avoids toxic/offensive content

Annotation Protocols:

  • Likert scales (1-5 or 1-7)

  • Pairwise comparisons (A vs B)

  • Elo ratings (aggregated pairwise wins)

Challenges:

  • Expensive and slow

  • Annotator disagreement (inter-rater reliability)

  • Subjective criteria (cultural biases)

19 Key LLM Benchmarks

19.1 Reasoning & Math

GSM8K (Grade School Math):

  • 8.5K grade-school math word problems

  • Metric: Exact match on final numerical answer

  • Multi-step reasoning; benefits from chain-of-thought prompting

  • Key insight: Test-time compute scaling improves accuracy

  • SOTA progression: GPT-3 (17%) → GPT-3.5 (57%) → GPT-4 (92%) → o1 (95%+)

MATH:

  • Competition-level math (AMC, AIME)

  • 5 difficulty levels, 7 subjects

  • SOTA: GPT-3 (5%) → Minerva (50%) → GPT-4 (52%) → o1 (85%+)

  • Human olympiad contestants:  90%

HumanEval:

  • 164 Python programming problems with unit tests

  • Metric: pass@k (probability \(\geq\) 1 of \(k\) samples passes)

  • Tests code synthesis and functional correctness

  • SOTA pass@1: Codex (29%) → GPT-3.5 (48%) → GPT-4 (67%) → Claude 3.5 (92%)

19.2 Knowledge & Understanding

MMLU (Massive Multitask Language Understanding):

  • 57 subjects (STEM, humanities, social sciences, professional)

  • 15K multiple-choice questions (high school to professional)

  • Tests breadth of knowledge and zero-shot reasoning

  • Gold standard for LLM capability comparison

  • SOTA: Random (25%) → GPT-3 (43%) → GPT-3.5 (70%) → GPT-4 (86%) → Gemini 1.5 Pro (90%)

TruthfulQA:

  • 817 questions eliciting common misconceptions

  • Tests hallucination tendency

  • Metric: % truthful and informative answers

  • Challenge: Larger models often worse (GPT-3: 28%, GPT-4: 59%); RLHF helps

19.3 Long-Context

RULER (Needle-in-Haystack):

  • Tests retrieval from contexts up to 128K tokens

  • Place "needle" at different depths, measure recall

  • Exposes position bias (models worse at middle)

20 Vision-Language Model (VLM) Metrics & Benchmarks

20.1 Metrics for Vision

20.1.1 FID (Fréchet Inception Distance)

Definition: Distance between feature distributions of real vs generated images: \[\text{FID} = \|\mu_r - \mu_g\|^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})\] where \(\mu, \Sigma\) are mean/covariance of Inception-v3 features.

Use Cases:

  • GANs, diffusion models

  • Lower FID = better quality/diversity

Limitations:

  • Requires 10K+ images for stability

  • Inception bias (ImageNet features)

  • Doesn’t capture fine-grained artifacts

20.1.2 CLIP Score

Definition: Cosine similarity between CLIP embeddings: \[\text{CLIPScore}(I, c) = \text{cosine}(\text{CLIP}_{\text{img}}(I), \text{CLIP}_{\text{text}}(c))\]

Use Cases:

  • Text-to-image generation (prompt adherence)

  • Image captioning (semantic alignment)

Advantage: Reference-free (no ground truth needed)

Limitation: CLIP biases from web data

20.1.3 CIDEr & SPICE

CIDEr: TF-IDF weighted n-gram overlap (emphasizes descriptive terms)

SPICE: Scene graph F1 (parses captions into objects/attributes/relationships)

  • Better semantic capture than n-gram metrics

  • Standard for COCO Captions benchmark

20.2 Key VLM Benchmarks

COCO Captions:

  • 330K images, 5 captions each

  • Metrics: BLEU-4, CIDEr, SPICE

  • Breakthrough: Show and Tell (2015, CIDEr 94) → CLIP-based models (2021, CIDEr 140+)

VQA v2:

  • 1M+ questions on 200K images

  • Balanced to reduce language bias

  • SOTA: BERT-based (2019, 71%) → CLIP+GPT (2021, 78%) → GPT-4V (2023, 77%)

MMMU:

  • College-level problems across 30+ subjects

  • Diagrams, charts, scientific figures

  • SOTA: Random (26%) → GPT-4V (56%) → Gemini 1.5 Pro (62%), Human expert: 89%

21 Reinforcement Learning Evaluation

21.1 RL Metrics

21.1.1 Return (Cumulative Reward)

Definition: \[G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}\] where \(\gamma\) is discount factor (typically 0.99).

Aggregation:

  • Mean return over evaluation episodes

  • Median (robust to outliers)

  • Interquartile mean (IQM): mean of middle 50%

21.1.2 Human-Normalized Score

Definition: \[\text{Score} = \frac{R_{\text{agent}} - R_{\text{random}}}{R_{\text{human}} - R_{\text{random}}}\]

Interpretation:

  • 0 = random policy, 1 = human-level, \(>1\) = superhuman

  • Allows cross-game comparison (Atari benchmarks)

21.2 Key RL Benchmarks

Atari 2600:

  • 57 games (Pong, Breakout, Space Invaders)

  • Standard for deep RL (DQN, Rainbow, PPO)

  • Atari 100K: Sample efficiency (100K environment steps)

  • Milestones: DQN (2013, human-level) → Rainbow (2018, 230% human) → MuZero (2020, 350%+)

MuJoCo:

  • Continuous control: Hopper, Walker2d, Ant, Humanoid

  • Tests continuous control algorithms (SAC, TD3, PPO)

  • Breakthroughs: TRPO (2015) → PPO (2017) → SAC (2018, SOTA on most tasks)

DMControl Suite:

  • 30+ tasks, pixel-based and state-based

  • More diverse than MuJoCo

22 Modern Evaluation Paradigms

22.1 Test-Time Compute Scaling

Core Idea: Allocate more inference compute to improve performance (alternative to scaling model size).

Techniques:

  • Best-of-N: Generate \(N\) samples, select best (by reward model or heuristic)

  • Chain-of-Thought (CoT): Prompt for reasoning steps before answer

  • Self-Consistency: Multiple CoT paths with majority vote

  • Tree-of-Thoughts: Explore multiple reasoning branches, backtrack

Impact: 5-50% accuracy gain on reasoning tasks (GSM8K, MATH)

Trade-offs:

  • Latency: More compute → slower responses

  • Cost: Each sample costs API calls or GPU time

  • Diminishing returns: Gains plateau after \(N \sim 100\)

22.2 Model-as-Judge

Setup: Use LLM (e.g., GPT-4) to score outputs instead of human eval.

Pairwise Comparison:

  • Present outputs A vs B, ask which is better

  • Aggregate via Elo ratings or win rate

Challenges:

  • Position bias: Favors first option

  • Self-preference: Favors own outputs

  • Length bias: Favors longer responses

Mitigation:

  • Swap order (A/B → B/A), aggregate

  • Use third-party judge

Benchmarks:

  • Chatbot Arena: User battles with Elo ratings (100K+ battles)

  • MT-Bench: 80 multi-turn questions, GPT-4 judge (correlates r > 0.9 with human)

22.3 Contamination & Data Leakage

Problem: Training data contains test examples → inflated performance.

Detection:

  • N-gram overlap between train and test

  • Embedding similarity thresholds

Mitigation:

  • Held-out test sets (never released)

  • Dynamic benchmarks (generated on-the-fly)

  • Document decontamination in model cards

23 Production Evaluation

23.1 Golden Sets & Test Suites

Composition:

  • Regression tests: Previous bugs that must not reoccur

  • Adversarial examples: Known failure modes (prompt injections)

  • Edge cases: Multilingual, code-switching

  • Domain-specific: Medical disclaimers, PII redaction

Best Practice:

  • 100-1K curated examples

  • CI/CD integration: block deploys if accuracy drops

  • Continuously update with production failures

23.2 Red-Teaming

Goal: Adversarial testing for safety/security vulnerabilities.

Approaches:

  • Manual: Security experts craft adversarial prompts

  • Automated: RL agents trained to elicit harmful outputs

  • Crowdsourced: Platform users paid to break model

Attack Vectors:

  • Jailbreaks (bypass safety guardrails)

  • Prompt injections (ignore system instructions)

  • Toxicity elicitation

  • PII extraction

Metric: Attack success rate (ASR)

23.3 A/B Testing & Online Evaluation

Setup: Deploy model variant to subset of users, measure real-world impact.

Metrics:

  • Engagement: Click-through rate, session length, retention

  • Task success: Completion rate, user satisfaction

  • Guardrail violations: Toxicity reports, user blocks

Statistical Rigor:

  • Minimum detectable effect (MDE)

  • Multiple testing correction (Bonferroni)

  • Run 1-2 weeks (avoid novelty effect)

23.4 Observability & Monitoring

Real-Time Metrics:

  • Latency: P50, P95, P99

  • Throughput: Requests/sec, tokens/sec

  • Error rate: 5xx, timeouts, OOM

Quality Metrics:

  • Output length distribution

  • Refusal rate

  • Toxicity classifier scores

24 Interview Cheat Sheet

24.1 Key Datasets to Know

LLM Text:

  • Reasoning: GSM8K (grade-school math), MATH (competition math), HumanEval (code)

  • Knowledge: MMLU (57 subjects, multiple choice)

  • QA: SQuAD, Natural Questions, TriviaQA

  • Safety: TruthfulQA (hallucinations), ToxiGen (toxicity), BBQ (bias)

  • Long-context: RULER (needle-in-haystack), LongBench

VLM:

  • Captioning: COCO Captions, Nocaps

  • VQA: VQA v2, GQA (reasoning), TextVQA (OCR)

  • Multimodal: MMMU (college-level), MMBench

RL:

  • Games: Atari 2600, Procgen

  • Control: MuJoCo (Hopper, Walker, Ant), DMControl

  • Multi-task: Meta-World (50 robotic tasks)

NoteMetrics Summary
Metric Use Case Key Property
Perplexity Language modeling Exponentiated NLL
BLEU Translation, captioning Precision-based n-gram overlap
ROUGE Summarization Recall-based n-gram overlap
BERTScore Semantic similarity Contextual embedding matching
CIDEr Image captioning TF-IDF weighted n-grams
SPICE Caption semantic content Scene graph F1
FID Image generation quality Fréchet distance of Inception features
CLIP Score Text-image alignment CLIP embedding similarity
Return RL episode performance Cumulative discounted reward
Human-norm RL cross-game comparison Scaled by random/human baseline

24.2 Common Interview Questions

“What’s the difference between BLEU and ROUGE?”
BLEU measures precision (how much of generation appears in reference), ROUGE measures recall (how much of reference appears in generation). BLEU penalizes brevity; ROUGE is more forgiving. BLEU for translation, ROUGE for summarization.

“Why is perplexity not enough to evaluate LLMs?”
Perplexity measures surprisal on next-token prediction, not generation quality. A model can have low perplexity (good at predicting) but generate factually incorrect or unhelpful text. Doesn’t capture coherence, factuality, or alignment.

“How do you evaluate a text-to-image model?”
Quantitative: FID (distribution match), CLIP Score (prompt adherence), Inception Score (quality+diversity). Qualitative: Human ratings on photorealism, prompt fidelity, diversity. A/B testing in production.

“What is test-time compute scaling?”
Allocating more inference compute (e.g., best-of-N sampling, chain-of-thought, self-consistency) to improve performance. Alternative to scaling model size. Effective for reasoning tasks (GSM8K, MATH) but increases latency/cost.

“How do you detect benchmark contamination?”
N-gram overlap between train and test, embedding similarity, substring matching. Mitigation: use held-out test sets, dynamic benchmarks (generated on-demand), report decontamination procedures in model cards.

“What is model-as-judge, and what are its limitations?”
Use LLM (e.g., GPT-4) to score/rank outputs instead of human eval. Faster and cheaper. Limitations: position bias (favors first option), self-preference (favors own outputs), length bias (favors longer), may miss nuanced errors.

“Describe red-teaming for LLMs.”
Adversarial testing to find safety/security vulnerabilities. Manual (experts craft jailbreaks), automated (RL agents trained to elicit harmful outputs), crowdsourced. Metrics: attack success rate, time-to-break. Critical for pre-deployment safety validation.

“How do you evaluate long-context models?”
Needle-in-haystack (RULER): place fact at different depths in 128K context, measure recall. Position bias analysis (models worse at middle). Long-form QA/summarization benchmarks (LongBench). Check for context length extrapolation failures.

24.3 Production Best Practices

Golden Sets:

  • 100-1K curated examples covering critical failure modes

  • Regression tests + adversarial + edge cases + domain-specific

  • CI/CD integration: block deploys if golden set accuracy drops

  • Continuously update with production failures

A/B Testing:

  • 5-10% traffic to new model, measure engagement/satisfaction

  • Statistical significance: MDE, multiple testing correction

  • Monitor for Simpson’s paradox (subgroup effects)

  • Run for 1-2 weeks (avoid novelty effect bias)

Note

Simpson’s Paradox in A/B Testing

A trend can hold in every subgroup but reverse in aggregate (or vice versa). Example:

Old Model:

  • Medical queries (100 users): 90 satisfied → 90% satisfaction

  • General queries (900 users): 450 satisfied → 50% satisfaction

  • Overall: 540/1000 = 54% satisfaction

New Model:

  • Medical queries (900 users): 720 satisfied → 80% satisfaction (worse than 90%)

  • General queries (100 users): 40 satisfied → 40% satisfaction (worse than 50%)

  • Overall: 760/1000 = 76% satisfaction (better!)

How? Subgroup sizes flipped–new model received more traffic from high-performing category (medical), even though it’s worse at both tasks. Aggregate metric misleads.

Lesson: Always segment by user type, query category, language. Overall metrics can hide regressions in critical subpopulations.

Observability:

  • Real-time: P95/P99 latency, error rate, throughput

  • Quality: Output length, refusal rate, toxicity scores

  • Drift: Input/output distribution shifts, golden set accuracy

  • Alerting: Latency spikes, error rate \(>\) threshold, golden set failures

25 Summary

Evaluation Philosophy:

  • No single metric captures all aspects of model quality

  • Combine automated metrics (fast, scalable) with human eval (nuanced, expensive)

  • Test-time compute scaling increasingly important (CoT, self-consistency, best-of-N)

  • Contamination/leakage undermines benchmark validity–use dynamic evals

Key Trade-offs:

  • Correlation vs cost: Human eval gold standard but expensive; model-as-judge cheaper but biased

  • Static vs dynamic: Static benchmarks saturate; dynamic benchmarks harder to compare across time

  • In-distribution vs OOD: High performance on benchmarks \(\neq\) robust in production

Emerging Trends:

  • Test-time compute scaling laws (inference-time reasoning)

  • Agentic evaluation (models solve tasks via tool use, not just text generation)

  • Multimodal benchmarks (MMMU, video understanding)

  • Safety-first evaluation (red-teaming, jailbreak robustness, alignment)