14 Chapter 13: Reinforcement Learning

15 RL Foundations

15.1 Core Concepts

An MDP is defined by the tuple $(S, A, P, r, \gamma)$:

$S$: State space – environment observations (LLMs: prompt/context)
$A$: Action space – possible actions (LLMs: tokens/completions)
$P(s'|s,a)$: Transition dynamics – next state distribution
$r(s,a)$: Reward function – immediate reward for taking action $a$ in state $s$
$\gamma \in [0,1]$: Discount factor – weight for future rewards

A policy $\pi_\theta(a|s)$ is a conditional probability distribution over actions given states, parameterized by $\theta$.

Classical RL: typically a neural network mapping states to action probabilities
LLMs: the entire language model itself

A trajectory $\tau$ is a sequence of states and actions: \[\tau = (s_0, a_0, s_1, a_1, \ldots, s_T, a_T)\] In LLMs, a trajectory is a generated completion.

15.2 Value Functions

The value function $V^\pi(s)$ is the expected return starting from state $s$ under policy $\pi$: \[V^\pi(s) = \mathbb{E}_{\pi}\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0 = s\right]\]

The Q-function $Q^\pi(s,a)$ is the expected return starting from state $s$, taking action $a$, then following $\pi$: \[Q^\pi(s,a) = \mathbb{E}_{\pi}\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0 = s, a_0 = a\right]\]

The advantage function measures how much better an action is compared to the policy’s average: \[A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)\]

Note

Intuition:

$V^\pi(s)$: "How good is this state?"
$Q^\pi(s,a)$: "How good is taking action $a$ in state $s$?"
$A^\pi(s,a)$: "How much better is action $a$ than the average action?"

If $A^\pi(s,a) > 0$, action $a$ is better than average; if $A^\pi(s,a) < 0$, it’s worse.

15.3 The Discount Factor $\gamma$

Why do we discount future rewards?

Convergence: For infinite-horizon problems, $\sum_{t=0}^\infty r_t$ may diverge. Using $\gamma < 1$ ensures $\sum_{t=0}^\infty \gamma^t r_t$ converges.
Credit assignment: Encourages immediate rewards over distant ones – helps learning by focusing on near-term consequences.
Mathematical convenience: Makes the Bellman operator a contraction mapping, guaranteeing convergence of iterative algorithms.
LLM RLHF: Episodes are short (single completion), so often $\gamma = 1$ (no discounting needed).

15.4 Objective: Expected Return

The RL objective is to maximize expected return: \[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \gamma^t r_t\right] = \mathbb{E}_{s \sim d^\pi, a \sim \pi_\theta}[Q^\pi(s,a)]\]

where $d^\pi$ is the state distribution induced by policy $\pi$.

16 Deep Q-Networks (DQN)

16.1 From Q-Learning to DQN

Classical Q-learning maintains a table $Q(s,a)$ for discrete state-action spaces, updated via: \[Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]\]

This fails for large/continuous state spaces (e.g., Atari pixels). DQN (Mnih et al., 2015) approximates $Q(s,a)$ with a neural network $Q_\theta(s,a)$, enabling RL in high-dimensional environments.

16.2 Core Innovations

1. Experience Replay: Store transitions $(s, a, r, s')$ in replay buffer $\mathcal{D}$. Sample random mini-batches to train, breaking temporal correlations.

Why this matters: Sequential samples are highly correlated (same trajectory), causing unstable training. Random sampling makes data i.i.d.-like, stabilizing gradient updates.

2. Target Network: Maintain separate target network $Q_{\theta^-}$ with frozen weights, updated periodically from $Q_\theta$.

Loss function: \[\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[\left(r + \gamma \max_{a'} Q_{\theta^-}(s',a') - Q_\theta(s,a)\right)^2\right]\]

Why this matters: Without a target network, both sides of the Bellman update change simultaneously (chasing a moving target), causing oscillations. Freezing $\theta^-$ for $C$ steps (e.g., $C=10{,}000$) stabilizes training.

16.3 DQN Algorithm

Initialize replay buffer $\mathcal{D}$ with capacity $N$ Initialize Q-network $Q_\theta$ with random weights $\theta$ Initialize target network $Q_{\theta^-}$ with $\theta^- = \theta$ Observe initial state $s_0$ Select action $a_t = \begin{cases} \text{random action} & \text{w.p. } \epsilon \\ \arg\max_a Q_\theta(s_t, a) & \text{otherwise} \end{cases}$ Execute $a_t$, observe reward $r_t$ and next state $s_{t+1}$ Store transition $(s_t, a_t, r_t, s_{t+1})$ in $\mathcal{D}$ Sample random mini-batch of transitions from $\mathcal{D}$ Compute target: $y_i = r_i + \gamma \max_{a'} Q_{\theta^-}(s_{i+1}, a')$ Perform gradient descent on $\mathcal{L}(\theta) = (y_i - Q_\theta(s_i, a_i))^2$ Every $C$ steps: $\theta^- \leftarrow \theta$

16.4 Extensions and Improvements

Double DQN (DDQN): Addresses overestimation bias in $\max_{a'} Q(s',a')$.

Standard DQN uses same network to select and evaluate action: \[y = r + \gamma Q_{\theta^-}(s', \arg\max_{a'} Q_{\theta^-}(s',a'))\]

DDQN decouples selection (online network) from evaluation (target network): \[y = r + \gamma Q_{\theta^-}(s', \arg\max_{a'} Q_\theta(s',a'))\]

This reduces positive bias from max operator.

Dueling DQN: Decomposes $Q(s,a) = V(s) + A(s,a)$ where $V(s)$ is state value and $A(s,a)$ is advantage.

Architecture: Shared convolutional encoder → split into two streams → combine via $Q(s,a) = V(s) + A(s,a) - \frac{1}{|A|}\sum_{a'} A(s,a')$ (centering).

Why this helps: For many states, action choice doesn’t matter much. Learning $V(s)$ separately improves sample efficiency.

Prioritized Experience Replay: Sample transitions with probability proportional to TD error $|\delta| = |r + \gamma \max_{a'} Q(s',a') - Q(s,a)|$.

High-error transitions provide more learning signal. Corrects bias via importance sampling weights.

16.5 DQN vs Policy Gradient Methods

Aspect	DQN (Value-Based)	PPO (Policy-Based)
Action space	Discrete only	Continuous + discrete
Sample efficiency	Higher (off-policy)	Lower (on-policy)
Stability	Requires tricks (replay, target net)	More stable (clipping)
Exploration	$\epsilon$-greedy	Stochastic policy
LLM applicability	Poor (discrete tokens, but huge action space)	Excellent

Why DQN rarely used for LLMs: Vocabulary size is 30K-100K tokens → $Q(s,a)$ has 100K outputs per state. Softmax over Q-values works, but policy gradient methods (PPO) learn distributions directly, handling large action spaces more naturally.

Note

Historical Impact: DQN achieved human-level Atari game performance from raw pixels (2015), launching the deep RL revolution. While less relevant for LLMs today, its innovations (experience replay, target networks) influenced later algorithms and remain foundational for understanding modern RL.

17 Policy Optimization: From TRPO to DPO

17.1 Foundation: Policy Gradient Theorem

The policy gradient theorem provides the foundation for all policy optimization methods.

The gradient of the expected return with respect to policy parameters $\theta$ is: \[\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)\right]\] where $Q^\pi(s,a)$ is the action-value function under policy $\pi$.

Note

Key Insight: We can improve the policy by moving in the direction that increases the log-probability of actions with high Q-values (good outcomes) and decreases the log-probability of actions with low Q-values (bad outcomes).

17.2 Why $\nabla_\theta \log \pi_\theta$?

This is the score function or log-derivative trick:

Derivation: \[\begin{align*} \nabla_\theta J(\theta) & = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] \\ & = \nabla_\theta \int \pi_\theta(\tau) R(\tau) \, d\tau \\ & = \int \nabla_\theta \pi_\theta(\tau) R(\tau) \, d\tau \\ & = \int \pi_\theta(\tau) \frac{\nabla_\theta \pi_\theta(\tau)}{\pi_\theta(\tau)} R(\tau) \, d\tau \\ & = \int \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) R(\tau) \, d\tau \\ & = \mathbb{E}_{\tau \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(\tau) \cdot R(\tau)\right] \end{align*}\]

The key trick: $\nabla_\theta \pi_\theta = \pi_\theta \nabla_\theta \log \pi_\theta$ (from chain rule on logs).

Why this matters:

We can estimate gradients by sampling trajectories – no need to differentiate through the environment dynamics!
The $\log \pi_\theta$ term makes gradients tractable for neural networks
This is the foundation of REINFORCE, TRPO, PPO, and ultimately DPO

Example

Concrete Example: Why $\log$?

Suppose $\pi_\theta(a|s) = \text{softmax}(f_\theta(s))_a = \frac{e^{f_\theta(s)_a}}{\sum_j e^{f_\theta(s)_j}}$.

Then: \[\log \pi_\theta(a|s) = f_\theta(s)_a - \log \sum_j e^{f_\theta(s)_j}\]

Taking gradient: \[\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta f_\theta(s)_a - \mathbb{E}_{a' \sim \pi_\theta}[\nabla_\theta f_\theta(s)_{a'}]\]

This is the difference between gradients for the chosen action and the expected gradient – exactly the "reinforce good actions, suppress bad actions" intuition!

17.3 Step 1: Vanilla Policy Gradient (REINFORCE)

The simplest approach uses the policy gradient directly: \[\theta_{t+1} = \theta_t + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t\]

Problems:

High variance in gradient estimates
Destructive updates: Large steps can catastrophically degrade performance
No constraint on how much policy changes between updates

17.4 Step 2: Trust Region Policy Optimization (TRPO)

TRPO addresses destructive updates by constraining policy changes using KL divergence.

\[\max_\theta \mathbb{E}_{s,a \sim \pi_{\theta_{\text{old}}}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\text{old}}}(s,a)\right]\] subject to: \[\mathbb{E}_{s \sim \pi_{\theta_{\text{old}}}}\left[\text{KL}\left(\pi_{\theta_{\text{old}}}(\cdot|s) \,\|\, \pi_\theta(\cdot|s)\right)\right] \leq \delta\]

Key Components:

Importance sampling ratio: $r_t(\theta) = \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}$ reweights old trajectories
Advantage function: $A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$ measures how much better action $a$ is than average
KL constraint: Ensures new policy $\pi_\theta$ stays close to old policy $\pi_{\theta_{\text{old}}}$

Note

Why KL Divergence? KL measures how much the distribution over actions changes. Keeping it small prevents catastrophic policy collapse. The constraint uses the Fisher information metric from the policy’s probability distribution.

Limitations:

Computationally expensive (requires solving constrained optimization with conjugate gradient)
Requires second-order derivatives (Hessian-vector products)
Difficult to implement correctly

17.5 Step 3: Proximal Policy Optimization (PPO)

PPO simplifies TRPO by replacing the hard KL constraint with a clipped objective.

\[L^{\text{CLIP}}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta) A_t, \, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]\] where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ and $\epsilon \approx 0.2$.

How Clipping Works:

If $A_t > 0$ (good action): clip $r_t$ to $[1, 1+\epsilon]$ – limit how much we increase its probability
If $A_t < 0$ (bad action): clip $r_t$ to $[1-\epsilon, 1]$ – limit how much we decrease its probability
Taking $\min$ chooses the more conservative (pessimistic) objective

Example

PPO Clipping Intuition:

Suppose $\epsilon = 0.2$ and we have an action with $A_t = +5$ (very good).

Unclipped: $r_t \cdot 5$ could grow arbitrarily large if $r_t$ is large
Clipped: $\min(r_t \cdot 5, \, 1.2 \cdot 5) = 6$ once $r_t > 1.2$
Result: PPO stops increasing action probability once it’s 20% more likely than before

This prevents overshooting: even if an action looks great, we don’t want to make it too dominant.

Note

PPO vs. TRPO:

TRPO: Hard KL constraint, requires second-order optimization
PPO: Soft constraint via clipping, first-order optimization only
PPO is simpler, faster, and often performs comparably to TRPO

17.6 Step 4: RLHF with Reward Models

For LLM alignment, we introduce Reinforcement Learning from Human Feedback (RLHF):

Collect preferences: Humans compare LLM outputs: "Response A $\succ$ Response B"
Train reward model: Use Bradley-Terry model to fit $r_\theta(x, y)$: \[P(y_1 \succ y_2 \mid x) = \sigma(r_\theta(x, y_1) - r_\theta(x, y_2))\]
Optimize policy: Use PPO to maximize predicted reward: \[\max_\theta \mathbb{E}_{x,y \sim \pi_\theta}\left[r_\theta(x, y)\right] - \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}})\]

The KL penalty $\beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}})$ prevents the policy from drifting too far from the reference model (avoiding reward hacking and maintaining language quality).

Note

Bradley-Terry Connection: The reward model is trained as a pairwise comparator using logistic regression on preference data. See the Logistic Regression notes for detailed derivation.

17.7 Step 5: Direct Preference Optimization (DPO)

DPO eliminates the reward model entirely by directly optimizing preferences!

The optimal policy under the RLHF objective has a closed form: \[\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r^*(x,y)\right)\]

Rearranging gives the implicit reward: \[r^*(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)\]

Key Idea: Instead of training a reward model $r_\theta$ and then optimizing it with RL, we can directly parameterize the policy and optimize the Bradley-Terry loss!

Given preference data $(x, y_w, y_l)$ where $y_w \succ y_l$: \[\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]\]

Derivation Sketch: \[\begin{align*} P(y_w \succ y_l \mid x) & = \sigma(r(x,y_w) - r(x,y_l)) \quad \text{(Bradley-Terry)} \\ & = \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right) \quad \text{(substitute implicit reward)} \end{align*}\]

We maximize this likelihood directly–no separate reward model or RL training loop needed!

Note

DPO Advantages:

Simpler: Single-stage training (no reward model, no PPO)
Stable: No reward hacking or value network training issues
Efficient: Standard supervised learning, easier to implement
Interpretable: Direct connection to Bradley-Terry preferences

17.8 Comparison: TRPO → PPO → RLHF → DPO

Method	Constraint	Complexity	Reward Model?	RL Loop?
TRPO	Hard KL	High (2nd order)	Optional	Yes
PPO	Soft (clipping)	Medium (1st order)	Optional	Yes
RLHF+PPO	KL penalty	Medium	Yes	Yes
DPO	Implicit (via $\pi_{\text{ref}}$)	Low	No	No

When to Use Each:

TRPO: Strong theoretical guarantees needed, computational cost acceptable
PPO: General-purpose RL, good balance of performance and simplicity
RLHF+PPO: Complex reward functions, need iterative refinement
DPO: High-quality preference data available, want simplicity and stability

17.9 Step 6: Group Relative Policy Optimization (GRPO / RLVR)

GRPO (also called RLVR - RL with Verifiable Rewards) is a recent approach that combines the best of PPO and DPO for LLM alignment.

Given a prompt $x$:

Sample multiple completions: $\{y_1, y_2, \ldots, y_K\} \sim \pi_\theta(\cdot|x)$
Evaluate each with verifiable reward function: $r(x, y_i)$
Use group relative advantages to update policy

Key Idea: Instead of training a separate reward model (RLHF) or using only pairwise preferences (DPO), GRPO:

Uses a verifiable reward (e.g., code correctness, math verification, rule compliance)
Computes advantages relative to the group of sampled completions
Applies vanilla policy gradient with group baseline

For each prompt $x$, sample $K$ completions and compute: \[\nabla_\theta J(\theta) = \mathbb{E}_{x, \{y_i\}_{i=1}^K \sim \pi_\theta}\left[\sum_{i=1}^K \nabla_\theta \log \pi_\theta(y_i|x) \cdot A_{\text{group}}(x, y_i)\right]\] where the group advantage is: \[A_{\text{group}}(x, y_i) = r(x, y_i) - \frac{1}{K}\sum_{j=1}^K r(x, y_j)\]

Intuition:

The baseline $\frac{1}{K}\sum_{j=1}^K r(x, y_j)$ is the average reward within the sampled group
If $y_i$ is better than average in the group, $A_{\text{group}}(x, y_i) > 0$ → increase its probability
If $y_i$ is worse than average, $A_{\text{group}}(x, y_i) < 0$ → decrease its probability
No value network needed–the group itself provides the baseline!

Example

GRPO Example: Code Generation

For prompt $x$ = "Write Python function to sort a list":

Sample $K=4$ completions from $\pi_\theta$
Run unit tests: $r(x, y_1) = 1.0$ (passes), $r(x, y_2) = 0.0$ (fails), $r(x, y_3) = 1.0$, $r(x, y_4) = 0.5$ (partial)
Group baseline: $\bar{r} = \frac{1.0 + 0.0 + 1.0 + 0.5}{4} = 0.625$
Advantages: $A_1 = +0.375$, $A_2 = -0.625$, $A_3 = +0.375$, $A_4 = -0.125$

Update increases probability of $y_1$ and $y_3$ (correct solutions), decreases $y_2$ and $y_4$.

Comparison to PPO and DPO:

vs. PPO: No value network, no clipping, simpler implementation. Group baseline replaces learned value function.
vs. DPO: Requires verifiable rewards (not just preferences). Can handle continuous reward signals, not limited to binary comparisons.
vs. RLHF: No separate reward model training phase. Reward must be computable (e.g., unit tests, rule checkers).

Note

When to Use GRPO:

Verifiable rewards available: Code correctness, math proofs, fact-checking, rule compliance
Want simplicity: Easier than PPO (no value network), more flexible than DPO (handles continuous rewards)
Sample efficiency matters: Group baseline reduces variance without requiring large replay buffers
Online learning: Can update immediately after sampling, no offline preference collection needed

Implementation Details:

Typical group size: $K = 4$ to $16$ completions per prompt
Can add KL penalty to reference model: $J = \mathbb{E}[\sum A_{\text{group}} \log \pi_\theta] - \beta \text{KL}(\pi_\theta \| \pi_{\text{ref}})$
Often combined with rejection sampling: only update on prompts where at least one completion succeeds
Scales well with GPU parallelism (sample multiple completions in parallel)

17.10 Updated Comparison: TRPO → PPO → RLHF → DPO → GRPO

Method	Constraint	Reward	Value Net?	Complexity
TRPO	Hard KL	Any	Optional	High (2nd order)
PPO	Soft (clipping)	Any	Yes	Medium
RLHF+PPO	KL penalty	Learned (BT)	Yes	Medium
DPO	Implicit (via $\pi_{\text{ref}}$)	Preferences	No	Low
GRPO	Optional KL	Verifiable	No	Low

Summary:

TRPO/PPO: General RL with any reward function, requires value network
RLHF: Learns reward from preferences, full RL loop with PPO
DPO: Bypasses reward model and RL loop, direct preference optimization
GRPO: Bypasses value network, uses group baseline with verifiable rewards

18 Example: Invoice Extraction

18.1 Problem Setup

Goal: Fine-tune an open-source document understanding model (e.g., LayoutLMv3, Donut) to extract structured data from invoices using reinforcement learning.

Note

Why RL for Invoice Extraction?

Reward easier to specify than exhaustive labels (business rules)
Can optimize complex objectives: accuracy + confidence + hallucination reduction
Handles sparse feedback: overall extraction quality vs. per-field annotations

18.2 MDP Formulation

Define the Markov Decision Process as:

State $s$: Invoice image + OCR text + layout features
Action $a$: Generate structured extraction (JSON with fields: vendor, date, total, line items)
Reward $r(s,a)$: Composite score measuring extraction quality
Transition: Deterministic (single-step episode per invoice)

18.3 Reward Function Design

The reward function combines multiple components:

\[\begin{align} R(a, y^*) & = w_1 \cdot R_{\text{field}}(a, y^*) + w_2 \cdot R_{\text{struct}}(a) \notag \\ & \quad + w_3 \cdot R_{\text{format}}(a) - w_4 \cdot R_{\text{halluc}}(a) \end{align}\]

where:

$R_{\text{field}}$: F1 score for each extracted field (vendor, date, amounts)
$R_{\text{struct}}$: Structural validation (line items sum to subtotal, tax calculations)
$R_{\text{format}}$: Format compliance (date formats, currency, decimals)
$R_{\text{halluc}}$: Penalty for hallucinated or missing required fields

Example

Example Reward Weights: \[R = 0.4 \cdot F1_{\text{fields}} + 0.3 \cdot \text{Validation}_{\text{struct}} + 0.2 \cdot \text{Compliance}_{\text{fmt}} - 0.1 \cdot \text{Penalty}_{\text{halluc}}\]

For an invoice with:

Field extraction F1 = 0.9 (correct vendor, date, total)
Structural validation = 1.0 (subtotal + tax = total)
Format compliance = 0.8 (minor date format issue)
No hallucinations = 0

Total reward: $R = 0.4(0.9) + 0.3(1.0) + 0.2(0.8) - 0.1(0) = 0.82$

18.4 RL Algorithm: PPO

Proximal Policy Optimization (PPO) is well-suited for this task:

\[\begin{align} L^{\text{CLIP}}(\theta) & = \mathbb{E}(\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)) \\ r_t(\theta) & = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \\ \hat{A}_t & = R_t - V_\phi(s_t) \end{align}\]

where $\epsilon = 0.2$ (clip range), $V_\phi$ is value network, and $\hat{A}_t$ is advantage estimate.

Note

Understanding the Value Network $V_\phi$:

$V_\phi(s_t)$ is a critic trained online alongside the policy–not a pre-trained reward model.

Architecture: Shares transformer backbone with policy $\pi_\theta$, but with a scalar output head (vs. token logits for policy)
Training objective: $L_{\text{value}}(\phi) = \mathbb{E}[(V_\phi(s_t) - R_t)^2]$ – learns to predict expected return
Purpose: Provides baseline for advantage $\hat{A}_t = R_t - V_\phi(s_t)$, reducing variance while keeping gradients unbiased

Key Distinction:

Component	Role	When Trained
Reward Model $R$	Scores completions (human preferences)	Offline (before RL)
Value Network $V_\phi$	Estimates expected future reward	Online (during RL)
Policy $\pi_\theta$	Generates tokens	Online (during RL)

The advantage $\hat{A}_t$ tells us: “How much better/worse was this action than expected?” rather than using raw returns which have high variance.

Key PPO Parameters:

Learning rate: $1 \times 10^{-5}$ (small for stability)
Batch size: 8 invoices per update
Epochs per batch: 3
Clip range $\epsilon$: 0.2
Value coefficient: 0.5, Entropy coefficient: 0.01

18.5 Alternative: Direct Preference Optimization (DPO)

DPO simplifies training by using preference pairs without explicit reward model:

\[\begin{align} \mathcal{L}_{\text{DPO}}(\theta) & = -\expect{\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)} \end{align}\]

where $y_w$ is preferred extraction, $y_l$ is rejected extraction, $\beta$ controls strength.

Example

Preference Pair Construction:

For invoice with ground truth: {vendor: "Acme Corp", total: $500.00}

Preferred: {vendor: "Acme Corp", total: 500.00}
Rejected: {vendor: "Acme", total: 50.00} (partial name, OCR error)

DPO learns to prefer complete, accurate extractions over common errors.

18.6 Implementation Considerations

Model Selection: LayoutLMv3 (vision+text), Donut (end-to-end), Pix2Struct
Training Strategy: Start with supervised fine-tuning baseline, then apply RL
RL Library: TRL (Transformer Reinforcement Learning), OpenRLHF
Infrastructure: PyTorch + DeepSpeed for distributed training
Evaluation: Per-field F1, end-to-end success rate, business rule compliance

Note

Challenges:

Reward hacking: Model exploits gaps in reward function
Sample efficiency: RL needs many iterations; use LoRA for faster experiments
Distribution shift: Test robustness on new invoice formats

19 Interview Questions

Note

Q1: What’s the difference between the reward model and the value network in PPO?

Reward Model $R$: Trained offline on human preferences, scores completions (e.g., 0.8 for helpful, 0.3 for harmful)
Value Network $V_\phi$: Trained online during RL, estimates expected future reward from state
Purpose: $V_\phi$ provides baseline for advantage $\hat{A}_t = R_t - V_\phi(s_t)$, reducing variance
Architecture: Both can share transformer backbone, different heads (scalar for value, logits for policy)

Q2: Why does PPO use the clipped objective instead of vanilla policy gradient?

Problem: Large policy updates can cause performance collapse
Solution: Clip ratio $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ to $[1-\epsilon, 1+\epsilon]$ (typically $\epsilon=0.2$)
Effect: Limits how much policy can change per update, ensuring stable training
vs TRPO: PPO simpler to implement (no conjugate gradient), similar performance

Q3: How does RLHF differ from supervised fine-tuning (SFT)?

SFT: Learn from demonstrations via cross-entropy loss $\mathcal{L} = -\log p(y|x)$
RLHF: Optimize reward from human feedback via policy gradient
Advantage: RLHF can learn beyond demonstrations (exploration), optimize non-differentiable objectives (safety, helpfulness)
Standard pipeline: SFT first (warm start), then RLHF for alignment

Q4: What is DPO and why is it simpler than PPO?

DPO (Direct Preference Optimization): Bypasses reward model, optimizes directly on preference pairs
Loss: $\mathcal{L} = -\mathbb{E}[\log \sigma(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)})]$
Benefit: No reward model training, no value network, simpler pipeline
Trade-off: Less flexible (requires pairwise data), may underperform PPO on complex tasks

Q5: How do you prevent reward hacking in RLHF?

KL penalty: Add $-\beta \cdot \text{KL}(\pi_\theta || \pi_{\text{ref}})$ to reward to keep policy close to reference
Diverse reward models: Train ensemble and use min/average to avoid exploiting single model
Iterative refinement: Collect more human feedback on policy outputs, retrain reward model
Rule-based constraints: Hard penalties for undesired behaviors (toxicity, hallucination)

Q6: Why use LoRA for RLHF instead of full fine-tuning?

Memory: RL stores 2 models (policy + reference), LoRA reduces footprint 10-100×
Sample efficiency: LoRA converges faster, enabling more experiments per GPU-hour
Multi-task: Can train separate LoRA adapters per task without reloading base model
Stability: Freezing base weights reduces catastrophic forgetting risk

Q7: What’s the three-step RLHF pipeline for ChatGPT/Claude?

SFT: Supervised fine-tune on high-quality demonstrations (instruction-following)
Reward Modeling: Train reward model on human preference pairs (A vs B comparisons)
PPO: Optimize policy to maximize reward while staying close to SFT model (KL penalty)

20 Chapter Summary

Core RL Concepts: Reinforcement learning frames LLM training as a Markov Decision Process where states are prompts/contexts, actions are token selections, and rewards come from human feedback. The policy $\pi_\theta$ (the LLM itself) learns to maximize expected cumulative reward $\mathbb{E}[\sum \gamma^t r_t]$ through iterative policy updates.

RLHF Pipeline: The standard three-step process starts with supervised fine-tuning on demonstrations, trains a reward model on human preference pairs using Bradley-Terry comparison, then optimizes the policy via PPO with KL-divergence regularization to prevent drift from the reference model. This produces aligned models (ChatGPT, Claude) that follow instructions while avoiding harmful outputs.

PPO Mechanics: Proximal Policy Optimization uses a clipped objective $\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t, 1\pm\epsilon)\hat{A}_t)$ to limit policy updates, preventing catastrophic performance collapse. The value network $V_\phi$ trains online to predict expected returns, providing a baseline for advantage estimation $\hat{A}_t = R_t - V_\phi(s_t)$ that reduces gradient variance.

DPO Alternative: Direct Preference Optimization bypasses the reward model entirely, optimizing directly on preference pairs via a reparameterization trick. While simpler (no reward model, no value network), DPO requires pairwise comparison data and may underperform PPO on complex tasks requiring fine-grained reward shaping.

Practical Considerations: Production RLHF faces reward hacking (model exploits reward function gaps), sample inefficiency (requires many rollouts), and distribution shift (test behavior differs from training). Solutions include KL penalties to constrain exploration, ensemble reward models to prevent exploitation, LoRA for memory efficiency, and iterative human-in-the-loop refinement.

Key Takeaways:

RLHF enables optimizing non-differentiable objectives (safety, helpfulness) beyond supervised learning
Reward model quality determines alignment ceiling–invest in diverse, high-quality preference data
KL penalty $\beta$ balances exploration vs stability–too high stays near reference, too low causes mode collapse
Start with SFT for warm start, use LoRA for sample efficiency, monitor reward hacking continuously
DPO works well for simple alignment (harmlessness), PPO better for complex objectives (reasoning + safety)

14 Chapter 13: Reinforcement Learning

15 RL Foundations

15.1 Core Concepts

15.2 Value Functions

15.3 The Discount Factor \(\gamma\)

15.4 Objective: Expected Return

16 Deep Q-Networks (DQN)

16.1 From Q-Learning to DQN

16.2 Core Innovations

16.3 DQN Algorithm

16.4 Extensions and Improvements

16.5 DQN vs Policy Gradient Methods

17 Policy Optimization: From TRPO to DPO

17.1 Foundation: Policy Gradient Theorem

17.2 Why \(\nabla_\theta \log \pi_\theta\)?

17.3 Step 1: Vanilla Policy Gradient (REINFORCE)

17.4 Step 2: Trust Region Policy Optimization (TRPO)

17.5 Step 3: Proximal Policy Optimization (PPO)

17.6 Step 4: RLHF with Reward Models

17.7 Step 5: Direct Preference Optimization (DPO)

17.8 Comparison: TRPO → PPO → RLHF → DPO

17.9 Step 6: Group Relative Policy Optimization (GRPO / RLVR)

17.10 Updated Comparison: TRPO → PPO → RLHF → DPO → GRPO

18 Example: Invoice Extraction

18.1 Problem Setup

18.2 MDP Formulation

18.3 Reward Function Design

18.4 RL Algorithm: PPO

18.5 Alternative: Direct Preference Optimization (DPO)

18.6 Implementation Considerations

19 Interview Questions

20 Chapter Summary