14 Chapter 13: Reinforcement Learning
15 RL Foundations
15.1 Core Concepts
An MDP is defined by the tuple \((S, A, P, r, \gamma)\):
\(S\): State space – environment observations (LLMs: prompt/context)
\(A\): Action space – possible actions (LLMs: tokens/completions)
\(P(s'|s,a)\): Transition dynamics – next state distribution
\(r(s,a)\): Reward function – immediate reward for taking action \(a\) in state \(s\)
\(\gamma \in [0,1]\): Discount factor – weight for future rewards
A policy \(\pi_\theta(a|s)\) is a conditional probability distribution over actions given states, parameterized by \(\theta\).
Classical RL: typically a neural network mapping states to action probabilities
LLMs: the entire language model itself
A trajectory \(\tau\) is a sequence of states and actions: \[\tau = (s_0, a_0, s_1, a_1, \ldots, s_T, a_T)\] In LLMs, a trajectory is a generated completion.
15.2 Value Functions
The value function \(V^\pi(s)\) is the expected return starting from state \(s\) under policy \(\pi\): \[V^\pi(s) = \mathbb{E}_{\pi}\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0 = s\right]\]
The Q-function \(Q^\pi(s,a)\) is the expected return starting from state \(s\), taking action \(a\), then following \(\pi\): \[Q^\pi(s,a) = \mathbb{E}_{\pi}\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0 = s, a_0 = a\right]\]
The advantage function measures how much better an action is compared to the policy’s average: \[A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)\]
Intuition:
\(V^\pi(s)\): "How good is this state?"
\(Q^\pi(s,a)\): "How good is taking action \(a\) in state \(s\)?"
\(A^\pi(s,a)\): "How much better is action \(a\) than the average action?"
If \(A^\pi(s,a) > 0\), action \(a\) is better than average; if \(A^\pi(s,a) < 0\), it’s worse.
15.3 The Discount Factor \(\gamma\)
Why do we discount future rewards?
Convergence: For infinite-horizon problems, \(\sum_{t=0}^\infty r_t\) may diverge. Using \(\gamma < 1\) ensures \(\sum_{t=0}^\infty \gamma^t r_t\) converges.
Credit assignment: Encourages immediate rewards over distant ones – helps learning by focusing on near-term consequences.
Mathematical convenience: Makes the Bellman operator a contraction mapping, guaranteeing convergence of iterative algorithms.
LLM RLHF: Episodes are short (single completion), so often \(\gamma = 1\) (no discounting needed).
15.4 Objective: Expected Return
The RL objective is to maximize expected return: \[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \gamma^t r_t\right] = \mathbb{E}_{s \sim d^\pi, a \sim \pi_\theta}[Q^\pi(s,a)]\]
where \(d^\pi\) is the state distribution induced by policy \(\pi\).
16 Deep Q-Networks (DQN)
16.1 From Q-Learning to DQN
Classical Q-learning maintains a table \(Q(s,a)\) for discrete state-action spaces, updated via: \[Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]\]
This fails for large/continuous state spaces (e.g., Atari pixels). DQN (Mnih et al., 2015) approximates \(Q(s,a)\) with a neural network \(Q_\theta(s,a)\), enabling RL in high-dimensional environments.
16.2 Core Innovations
1. Experience Replay: Store transitions \((s, a, r, s')\) in replay buffer \(\mathcal{D}\). Sample random mini-batches to train, breaking temporal correlations.
Why this matters: Sequential samples are highly correlated (same trajectory), causing unstable training. Random sampling makes data i.i.d.-like, stabilizing gradient updates.
2. Target Network: Maintain separate target network \(Q_{\theta^-}\) with frozen weights, updated periodically from \(Q_\theta\).
Loss function: \[\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[\left(r + \gamma \max_{a'} Q_{\theta^-}(s',a') - Q_\theta(s,a)\right)^2\right]\]
Why this matters: Without a target network, both sides of the Bellman update change simultaneously (chasing a moving target), causing oscillations. Freezing \(\theta^-\) for \(C\) steps (e.g., \(C=10{,}000\)) stabilizes training.
16.3 DQN Algorithm
Initialize replay buffer \(\mathcal{D}\) with capacity \(N\) Initialize Q-network \(Q_\theta\) with random weights \(\theta\) Initialize target network \(Q_{\theta^-}\) with \(\theta^- = \theta\) Observe initial state \(s_0\) Select action \(a_t = \begin{cases} \text{random action} & \text{w.p. } \epsilon \\ \arg\max_a Q_\theta(s_t, a) & \text{otherwise} \end{cases}\) Execute \(a_t\), observe reward \(r_t\) and next state \(s_{t+1}\) Store transition \((s_t, a_t, r_t, s_{t+1})\) in \(\mathcal{D}\) Sample random mini-batch of transitions from \(\mathcal{D}\) Compute target: \(y_i = r_i + \gamma \max_{a'} Q_{\theta^-}(s_{i+1}, a')\) Perform gradient descent on \(\mathcal{L}(\theta) = (y_i - Q_\theta(s_i, a_i))^2\) Every \(C\) steps: \(\theta^- \leftarrow \theta\)
16.4 Extensions and Improvements
Double DQN (DDQN): Addresses overestimation bias in \(\max_{a'} Q(s',a')\).
Standard DQN uses same network to select and evaluate action: \[y = r + \gamma Q_{\theta^-}(s', \arg\max_{a'} Q_{\theta^-}(s',a'))\]
DDQN decouples selection (online network) from evaluation (target network): \[y = r + \gamma Q_{\theta^-}(s', \arg\max_{a'} Q_\theta(s',a'))\]
This reduces positive bias from max operator.
Dueling DQN: Decomposes \(Q(s,a) = V(s) + A(s,a)\) where \(V(s)\) is state value and \(A(s,a)\) is advantage.
Architecture: Shared convolutional encoder → split into two streams → combine via \(Q(s,a) = V(s) + A(s,a) - \frac{1}{|A|}\sum_{a'} A(s,a')\) (centering).
Why this helps: For many states, action choice doesn’t matter much. Learning \(V(s)\) separately improves sample efficiency.
Prioritized Experience Replay: Sample transitions with probability proportional to TD error \(|\delta| = |r + \gamma \max_{a'} Q(s',a') - Q(s,a)|\).
High-error transitions provide more learning signal. Corrects bias via importance sampling weights.
16.5 DQN vs Policy Gradient Methods
| Aspect | DQN (Value-Based) | PPO (Policy-Based) |
|---|---|---|
| Action space | Discrete only | Continuous + discrete |
| Sample efficiency | Higher (off-policy) | Lower (on-policy) |
| Stability | Requires tricks (replay, target net) | More stable (clipping) |
| Exploration | \(\epsilon\)-greedy | Stochastic policy |
| LLM applicability | Poor (discrete tokens, but huge action space) | Excellent |
Why DQN rarely used for LLMs: Vocabulary size is 30K-100K tokens → \(Q(s,a)\) has 100K outputs per state. Softmax over Q-values works, but policy gradient methods (PPO) learn distributions directly, handling large action spaces more naturally.
Historical Impact: DQN achieved human-level Atari game performance from raw pixels (2015), launching the deep RL revolution. While less relevant for LLMs today, its innovations (experience replay, target networks) influenced later algorithms and remain foundational for understanding modern RL.
17 Policy Optimization: From TRPO to DPO
17.1 Foundation: Policy Gradient Theorem
The policy gradient theorem provides the foundation for all policy optimization methods.
The gradient of the expected return with respect to policy parameters \(\theta\) is: \[\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)\right]\] where \(Q^\pi(s,a)\) is the action-value function under policy \(\pi\).
Key Insight: We can improve the policy by moving in the direction that increases the log-probability of actions with high Q-values (good outcomes) and decreases the log-probability of actions with low Q-values (bad outcomes).
17.2 Why \(\nabla_\theta \log \pi_\theta\)?
This is the score function or log-derivative trick:
Derivation: \[\begin{align*} \nabla_\theta J(\theta) & = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] \\ & = \nabla_\theta \int \pi_\theta(\tau) R(\tau) \, d\tau \\ & = \int \nabla_\theta \pi_\theta(\tau) R(\tau) \, d\tau \\ & = \int \pi_\theta(\tau) \frac{\nabla_\theta \pi_\theta(\tau)}{\pi_\theta(\tau)} R(\tau) \, d\tau \\ & = \int \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) R(\tau) \, d\tau \\ & = \mathbb{E}_{\tau \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(\tau) \cdot R(\tau)\right] \end{align*}\]
The key trick: \(\nabla_\theta \pi_\theta = \pi_\theta \nabla_\theta \log \pi_\theta\) (from chain rule on logs).
Why this matters:
We can estimate gradients by sampling trajectories – no need to differentiate through the environment dynamics!
The \(\log \pi_\theta\) term makes gradients tractable for neural networks
This is the foundation of REINFORCE, TRPO, PPO, and ultimately DPO
Concrete Example: Why \(\log\)?
Suppose \(\pi_\theta(a|s) = \text{softmax}(f_\theta(s))_a = \frac{e^{f_\theta(s)_a}}{\sum_j e^{f_\theta(s)_j}}\).
Then: \[\log \pi_\theta(a|s) = f_\theta(s)_a - \log \sum_j e^{f_\theta(s)_j}\]
Taking gradient: \[\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta f_\theta(s)_a - \mathbb{E}_{a' \sim \pi_\theta}[\nabla_\theta f_\theta(s)_{a'}]\]
This is the difference between gradients for the chosen action and the expected gradient – exactly the "reinforce good actions, suppress bad actions" intuition!
17.3 Step 1: Vanilla Policy Gradient (REINFORCE)
The simplest approach uses the policy gradient directly: \[\theta_{t+1} = \theta_t + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t\]
Problems:
High variance in gradient estimates
Destructive updates: Large steps can catastrophically degrade performance
No constraint on how much policy changes between updates
17.4 Step 2: Trust Region Policy Optimization (TRPO)
TRPO addresses destructive updates by constraining policy changes using KL divergence.
\[\max_\theta \mathbb{E}_{s,a \sim \pi_{\theta_{\text{old}}}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\text{old}}}(s,a)\right]\] subject to: \[\mathbb{E}_{s \sim \pi_{\theta_{\text{old}}}}\left[\text{KL}\left(\pi_{\theta_{\text{old}}}(\cdot|s) \,\|\, \pi_\theta(\cdot|s)\right)\right] \leq \delta\]
Key Components:
Importance sampling ratio: \(r_t(\theta) = \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}\) reweights old trajectories
Advantage function: \(A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)\) measures how much better action \(a\) is than average
KL constraint: Ensures new policy \(\pi_\theta\) stays close to old policy \(\pi_{\theta_{\text{old}}}\)
Why KL Divergence? KL measures how much the distribution over actions changes. Keeping it small prevents catastrophic policy collapse. The constraint uses the Fisher information metric from the policy’s probability distribution.
Limitations:
Computationally expensive (requires solving constrained optimization with conjugate gradient)
Requires second-order derivatives (Hessian-vector products)
Difficult to implement correctly
17.5 Step 3: Proximal Policy Optimization (PPO)
PPO simplifies TRPO by replacing the hard KL constraint with a clipped objective.
\[L^{\text{CLIP}}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta) A_t, \, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]\] where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) and \(\epsilon \approx 0.2\).
How Clipping Works:
If \(A_t > 0\) (good action): clip \(r_t\) to \([1, 1+\epsilon]\) – limit how much we increase its probability
If \(A_t < 0\) (bad action): clip \(r_t\) to \([1-\epsilon, 1]\) – limit how much we decrease its probability
Taking \(\min\) chooses the more conservative (pessimistic) objective
PPO Clipping Intuition:
Suppose \(\epsilon = 0.2\) and we have an action with \(A_t = +5\) (very good).
Unclipped: \(r_t \cdot 5\) could grow arbitrarily large if \(r_t\) is large
Clipped: \(\min(r_t \cdot 5, \, 1.2 \cdot 5) = 6\) once \(r_t > 1.2\)
Result: PPO stops increasing action probability once it’s 20% more likely than before
This prevents overshooting: even if an action looks great, we don’t want to make it too dominant.
PPO vs. TRPO:
TRPO: Hard KL constraint, requires second-order optimization
PPO: Soft constraint via clipping, first-order optimization only
PPO is simpler, faster, and often performs comparably to TRPO
17.6 Step 4: RLHF with Reward Models
For LLM alignment, we introduce Reinforcement Learning from Human Feedback (RLHF):
Collect preferences: Humans compare LLM outputs: "Response A \(\succ\) Response B"
Train reward model: Use Bradley-Terry model to fit \(r_\theta(x, y)\): \[P(y_1 \succ y_2 \mid x) = \sigma(r_\theta(x, y_1) - r_\theta(x, y_2))\]
Optimize policy: Use PPO to maximize predicted reward: \[\max_\theta \mathbb{E}_{x,y \sim \pi_\theta}\left[r_\theta(x, y)\right] - \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}})\]
The KL penalty \(\beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}})\) prevents the policy from drifting too far from the reference model (avoiding reward hacking and maintaining language quality).
Bradley-Terry Connection: The reward model is trained as a pairwise comparator using logistic regression on preference data. See the Logistic Regression notes for detailed derivation.
17.7 Step 5: Direct Preference Optimization (DPO)
DPO eliminates the reward model entirely by directly optimizing preferences!
The optimal policy under the RLHF objective has a closed form: \[\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r^*(x,y)\right)\]
Rearranging gives the implicit reward: \[r^*(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)\]
Key Idea: Instead of training a reward model \(r_\theta\) and then optimizing it with RL, we can directly parameterize the policy and optimize the Bradley-Terry loss!
Given preference data \((x, y_w, y_l)\) where \(y_w \succ y_l\): \[\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]\]
Derivation Sketch: \[\begin{align*} P(y_w \succ y_l \mid x) & = \sigma(r(x,y_w) - r(x,y_l)) \quad \text{(Bradley-Terry)} \\ & = \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right) \quad \text{(substitute implicit reward)} \end{align*}\]
We maximize this likelihood directly–no separate reward model or RL training loop needed!
DPO Advantages:
Simpler: Single-stage training (no reward model, no PPO)
Stable: No reward hacking or value network training issues
Efficient: Standard supervised learning, easier to implement
Interpretable: Direct connection to Bradley-Terry preferences
17.8 Comparison: TRPO → PPO → RLHF → DPO
| Method | Constraint | Complexity | Reward Model? | RL Loop? |
|---|---|---|---|---|
| TRPO | Hard KL | High (2nd order) | Optional | Yes |
| PPO | Soft (clipping) | Medium (1st order) | Optional | Yes |
| RLHF+PPO | KL penalty | Medium | Yes | Yes |
| DPO | Implicit (via \(\pi_{\text{ref}}\)) | Low | No | No |
When to Use Each:
TRPO: Strong theoretical guarantees needed, computational cost acceptable
PPO: General-purpose RL, good balance of performance and simplicity
RLHF+PPO: Complex reward functions, need iterative refinement
DPO: High-quality preference data available, want simplicity and stability
17.9 Step 6: Group Relative Policy Optimization (GRPO / RLVR)
GRPO (also called RLVR - RL with Verifiable Rewards) is a recent approach that combines the best of PPO and DPO for LLM alignment.
Given a prompt \(x\):
Sample multiple completions: \(\{y_1, y_2, \ldots, y_K\} \sim \pi_\theta(\cdot|x)\)
Evaluate each with verifiable reward function: \(r(x, y_i)\)
Use group relative advantages to update policy
Key Idea: Instead of training a separate reward model (RLHF) or using only pairwise preferences (DPO), GRPO:
Uses a verifiable reward (e.g., code correctness, math verification, rule compliance)
Computes advantages relative to the group of sampled completions
Applies vanilla policy gradient with group baseline
For each prompt \(x\), sample \(K\) completions and compute: \[\nabla_\theta J(\theta) = \mathbb{E}_{x, \{y_i\}_{i=1}^K \sim \pi_\theta}\left[\sum_{i=1}^K \nabla_\theta \log \pi_\theta(y_i|x) \cdot A_{\text{group}}(x, y_i)\right]\] where the group advantage is: \[A_{\text{group}}(x, y_i) = r(x, y_i) - \frac{1}{K}\sum_{j=1}^K r(x, y_j)\]
Intuition:
The baseline \(\frac{1}{K}\sum_{j=1}^K r(x, y_j)\) is the average reward within the sampled group
If \(y_i\) is better than average in the group, \(A_{\text{group}}(x, y_i) > 0\) → increase its probability
If \(y_i\) is worse than average, \(A_{\text{group}}(x, y_i) < 0\) → decrease its probability
No value network needed–the group itself provides the baseline!
GRPO Example: Code Generation
For prompt \(x\) = "Write Python function to sort a list":
Sample \(K=4\) completions from \(\pi_\theta\)
Run unit tests: \(r(x, y_1) = 1.0\) (passes), \(r(x, y_2) = 0.0\) (fails), \(r(x, y_3) = 1.0\), \(r(x, y_4) = 0.5\) (partial)
Group baseline: \(\bar{r} = \frac{1.0 + 0.0 + 1.0 + 0.5}{4} = 0.625\)
Advantages: \(A_1 = +0.375\), \(A_2 = -0.625\), \(A_3 = +0.375\), \(A_4 = -0.125\)
Update increases probability of \(y_1\) and \(y_3\) (correct solutions), decreases \(y_2\) and \(y_4\).
Comparison to PPO and DPO:
vs. PPO: No value network, no clipping, simpler implementation. Group baseline replaces learned value function.
vs. DPO: Requires verifiable rewards (not just preferences). Can handle continuous reward signals, not limited to binary comparisons.
vs. RLHF: No separate reward model training phase. Reward must be computable (e.g., unit tests, rule checkers).
When to Use GRPO:
Verifiable rewards available: Code correctness, math proofs, fact-checking, rule compliance
Want simplicity: Easier than PPO (no value network), more flexible than DPO (handles continuous rewards)
Sample efficiency matters: Group baseline reduces variance without requiring large replay buffers
Online learning: Can update immediately after sampling, no offline preference collection needed
Implementation Details:
Typical group size: \(K = 4\) to \(16\) completions per prompt
Can add KL penalty to reference model: \(J = \mathbb{E}[\sum A_{\text{group}} \log \pi_\theta] - \beta \text{KL}(\pi_\theta \| \pi_{\text{ref}})\)
Often combined with rejection sampling: only update on prompts where at least one completion succeeds
Scales well with GPU parallelism (sample multiple completions in parallel)
17.10 Updated Comparison: TRPO → PPO → RLHF → DPO → GRPO
| Method | Constraint | Reward | Value Net? | Complexity | |
|---|---|---|---|---|---|
| TRPO | Hard KL | Any | Optional | High (2nd order) | |
| PPO | Soft (clipping) | Any | Yes | Medium | |
| RLHF+PPO | KL penalty | Learned (BT) | Yes | Medium | |
| DPO | Implicit (via \(\pi_{\text{ref}}\)) | Preferences | No | Low | |
| GRPO | Optional KL | Verifiable | No | Low |
Summary:
TRPO/PPO: General RL with any reward function, requires value network
RLHF: Learns reward from preferences, full RL loop with PPO
DPO: Bypasses reward model and RL loop, direct preference optimization
GRPO: Bypasses value network, uses group baseline with verifiable rewards
18 Example: Invoice Extraction
18.1 Problem Setup
Goal: Fine-tune an open-source document understanding model (e.g., LayoutLMv3, Donut) to extract structured data from invoices using reinforcement learning.
Why RL for Invoice Extraction?
Reward easier to specify than exhaustive labels (business rules)
Can optimize complex objectives: accuracy + confidence + hallucination reduction
Handles sparse feedback: overall extraction quality vs. per-field annotations
18.2 MDP Formulation
Define the Markov Decision Process as:
State \(s\): Invoice image + OCR text + layout features
Action \(a\): Generate structured extraction (JSON with fields: vendor, date, total, line items)
Reward \(r(s,a)\): Composite score measuring extraction quality
Transition: Deterministic (single-step episode per invoice)
18.3 Reward Function Design
The reward function combines multiple components:
\[\begin{align} R(a, y^*) & = w_1 \cdot R_{\text{field}}(a, y^*) + w_2 \cdot R_{\text{struct}}(a) \notag \\ & \quad + w_3 \cdot R_{\text{format}}(a) - w_4 \cdot R_{\text{halluc}}(a) \end{align}\]
where:
\(R_{\text{field}}\): F1 score for each extracted field (vendor, date, amounts)
\(R_{\text{struct}}\): Structural validation (line items sum to subtotal, tax calculations)
\(R_{\text{format}}\): Format compliance (date formats, currency, decimals)
\(R_{\text{halluc}}\): Penalty for hallucinated or missing required fields
Example Reward Weights: \[R = 0.4 \cdot F1_{\text{fields}} + 0.3 \cdot \text{Validation}_{\text{struct}} + 0.2 \cdot \text{Compliance}_{\text{fmt}} - 0.1 \cdot \text{Penalty}_{\text{halluc}}\]
For an invoice with:
Field extraction F1 = 0.9 (correct vendor, date, total)
Structural validation = 1.0 (subtotal + tax = total)
Format compliance = 0.8 (minor date format issue)
No hallucinations = 0
Total reward: \(R = 0.4(0.9) + 0.3(1.0) + 0.2(0.8) - 0.1(0) = 0.82\)
18.4 RL Algorithm: PPO
Proximal Policy Optimization (PPO) is well-suited for this task:
\[\begin{align} L^{\text{CLIP}}(\theta) & = \mathbb{E}(\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)) \\ r_t(\theta) & = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \\ \hat{A}_t & = R_t - V_\phi(s_t) \end{align}\]
where \(\epsilon = 0.2\) (clip range), \(V_\phi\) is value network, and \(\hat{A}_t\) is advantage estimate.
Understanding the Value Network \(V_\phi\):
\(V_\phi(s_t)\) is a critic trained online alongside the policy–not a pre-trained reward model.
Architecture: Shares transformer backbone with policy \(\pi_\theta\), but with a scalar output head (vs. token logits for policy)
Training objective: \(L_{\text{value}}(\phi) = \mathbb{E}[(V_\phi(s_t) - R_t)^2]\) – learns to predict expected return
Purpose: Provides baseline for advantage \(\hat{A}_t = R_t - V_\phi(s_t)\), reducing variance while keeping gradients unbiased
Key Distinction:
| Component | Role | When Trained |
|---|---|---|
| Reward Model \(R\) | Scores completions (human preferences) | Offline (before RL) |
| Value Network \(V_\phi\) | Estimates expected future reward | Online (during RL) |
| Policy \(\pi_\theta\) | Generates tokens | Online (during RL) |
The advantage \(\hat{A}_t\) tells us: “How much better/worse was this action than expected?” rather than using raw returns which have high variance.
Key PPO Parameters:
Learning rate: \(1 \times 10^{-5}\) (small for stability)
Batch size: 8 invoices per update
Epochs per batch: 3
Clip range \(\epsilon\): 0.2
Value coefficient: 0.5, Entropy coefficient: 0.01
18.5 Alternative: Direct Preference Optimization (DPO)
DPO simplifies training by using preference pairs without explicit reward model:
\[\begin{align} \mathcal{L}_{\text{DPO}}(\theta) & = -\expect{\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)} \end{align}\]
where \(y_w\) is preferred extraction, \(y_l\) is rejected extraction, \(\beta\) controls strength.
Preference Pair Construction:
For invoice with ground truth: {vendor: "Acme Corp", total: $500.00}
Preferred: {vendor: "Acme Corp", total: 500.00}
Rejected: {vendor: "Acme", total: 50.00} (partial name, OCR error)
DPO learns to prefer complete, accurate extractions over common errors.
18.6 Implementation Considerations
Model Selection: LayoutLMv3 (vision+text), Donut (end-to-end), Pix2Struct
Training Strategy: Start with supervised fine-tuning baseline, then apply RL
RL Library: TRL (Transformer Reinforcement Learning), OpenRLHF
Infrastructure: PyTorch + DeepSpeed for distributed training
Evaluation: Per-field F1, end-to-end success rate, business rule compliance
Challenges:
Reward hacking: Model exploits gaps in reward function
Sample efficiency: RL needs many iterations; use LoRA for faster experiments
Distribution shift: Test robustness on new invoice formats
19 Interview Questions
Q1: What’s the difference between the reward model and the value network in PPO?
A:
Reward Model \(R\): Trained offline on human preferences, scores completions (e.g., 0.8 for helpful, 0.3 for harmful)
Value Network \(V_\phi\): Trained online during RL, estimates expected future reward from state
Purpose: \(V_\phi\) provides baseline for advantage \(\hat{A}_t = R_t - V_\phi(s_t)\), reducing variance
Architecture: Both can share transformer backbone, different heads (scalar for value, logits for policy)
Q2: Why does PPO use the clipped objective instead of vanilla policy gradient?
A:
Problem: Large policy updates can cause performance collapse
Solution: Clip ratio \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) to \([1-\epsilon, 1+\epsilon]\) (typically \(\epsilon=0.2\))
Effect: Limits how much policy can change per update, ensuring stable training
vs TRPO: PPO simpler to implement (no conjugate gradient), similar performance
Q3: How does RLHF differ from supervised fine-tuning (SFT)?
A:
SFT: Learn from demonstrations via cross-entropy loss \(\mathcal{L} = -\log p(y|x)\)
RLHF: Optimize reward from human feedback via policy gradient
Advantage: RLHF can learn beyond demonstrations (exploration), optimize non-differentiable objectives (safety, helpfulness)
Standard pipeline: SFT first (warm start), then RLHF for alignment
Q4: What is DPO and why is it simpler than PPO?
A:
DPO (Direct Preference Optimization): Bypasses reward model, optimizes directly on preference pairs
Loss: \(\mathcal{L} = -\mathbb{E}[\log \sigma(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)})]\)
Benefit: No reward model training, no value network, simpler pipeline
Trade-off: Less flexible (requires pairwise data), may underperform PPO on complex tasks
Q5: How do you prevent reward hacking in RLHF?
A:
KL penalty: Add \(-\beta \cdot \text{KL}(\pi_\theta || \pi_{\text{ref}})\) to reward to keep policy close to reference
Diverse reward models: Train ensemble and use min/average to avoid exploiting single model
Iterative refinement: Collect more human feedback on policy outputs, retrain reward model
Rule-based constraints: Hard penalties for undesired behaviors (toxicity, hallucination)
Q6: Why use LoRA for RLHF instead of full fine-tuning?
A:
Memory: RL stores 2 models (policy + reference), LoRA reduces footprint 10-100×
Sample efficiency: LoRA converges faster, enabling more experiments per GPU-hour
Multi-task: Can train separate LoRA adapters per task without reloading base model
Stability: Freezing base weights reduces catastrophic forgetting risk
Q7: What’s the three-step RLHF pipeline for ChatGPT/Claude?
A:
SFT: Supervised fine-tune on high-quality demonstrations (instruction-following)
Reward Modeling: Train reward model on human preference pairs (A vs B comparisons)
PPO: Optimize policy to maximize reward while staying close to SFT model (KL penalty)
20 Chapter Summary
Core RL Concepts: Reinforcement learning frames LLM training as a Markov Decision Process where states are prompts/contexts, actions are token selections, and rewards come from human feedback. The policy \(\pi_\theta\) (the LLM itself) learns to maximize expected cumulative reward \(\mathbb{E}[\sum \gamma^t r_t]\) through iterative policy updates.
RLHF Pipeline: The standard three-step process starts with supervised fine-tuning on demonstrations, trains a reward model on human preference pairs using Bradley-Terry comparison, then optimizes the policy via PPO with KL-divergence regularization to prevent drift from the reference model. This produces aligned models (ChatGPT, Claude) that follow instructions while avoiding harmful outputs.
PPO Mechanics: Proximal Policy Optimization uses a clipped objective \(\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t, 1\pm\epsilon)\hat{A}_t)\) to limit policy updates, preventing catastrophic performance collapse. The value network \(V_\phi\) trains online to predict expected returns, providing a baseline for advantage estimation \(\hat{A}_t = R_t - V_\phi(s_t)\) that reduces gradient variance.
DPO Alternative: Direct Preference Optimization bypasses the reward model entirely, optimizing directly on preference pairs via a reparameterization trick. While simpler (no reward model, no value network), DPO requires pairwise comparison data and may underperform PPO on complex tasks requiring fine-grained reward shaping.
Practical Considerations: Production RLHF faces reward hacking (model exploits reward function gaps), sample inefficiency (requires many rollouts), and distribution shift (test behavior differs from training). Solutions include KL penalties to constrain exploration, ensemble reward models to prevent exploitation, LoRA for memory efficiency, and iterative human-in-the-loop refinement.
Key Takeaways:
RLHF enables optimizing non-differentiable objectives (safety, helpfulness) beyond supervised learning
Reward model quality determines alignment ceiling–invest in diverse, high-quality preference data
KL penalty \(\beta\) balances exploration vs stability–too high stays near reference, too low causes mode collapse
Start with SFT for warm start, use LoRA for sample efficiency, monitor reward hacking continuously
DPO works well for simple alignment (harmlessness), PPO better for complex objectives (reasoning + safety)