14  Chapter 13: Reinforcement Learning

15 RL Foundations

15.1 Core Concepts

An MDP is defined by the tuple \((S, A, P, r, \gamma)\):

  • \(S\): State space – environment observations (LLMs: prompt/context)

  • \(A\): Action space – possible actions (LLMs: tokens/completions)

  • \(P(s'|s,a)\): Transition dynamics – next state distribution

  • \(r(s,a)\): Reward function – immediate reward for taking action \(a\) in state \(s\)

  • \(\gamma \in [0,1]\): Discount factor – weight for future rewards

A policy \(\pi_\theta(a|s)\) is a conditional probability distribution over actions given states, parameterized by \(\theta\).

  • Classical RL: typically a neural network mapping states to action probabilities

  • LLMs: the entire language model itself

A trajectory \(\tau\) is a sequence of states and actions: \[\tau = (s_0, a_0, s_1, a_1, \ldots, s_T, a_T)\] In LLMs, a trajectory is a generated completion.

15.2 Value Functions

The value function \(V^\pi(s)\) is the expected return starting from state \(s\) under policy \(\pi\): \[V^\pi(s) = \mathbb{E}_{\pi}\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0 = s\right]\]

The Q-function \(Q^\pi(s,a)\) is the expected return starting from state \(s\), taking action \(a\), then following \(\pi\): \[Q^\pi(s,a) = \mathbb{E}_{\pi}\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0 = s, a_0 = a\right]\]

The advantage function measures how much better an action is compared to the policy’s average: \[A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)\]

Note

Intuition:

  • \(V^\pi(s)\): "How good is this state?"

  • \(Q^\pi(s,a)\): "How good is taking action \(a\) in state \(s\)?"

  • \(A^\pi(s,a)\): "How much better is action \(a\) than the average action?"

If \(A^\pi(s,a) > 0\), action \(a\) is better than average; if \(A^\pi(s,a) < 0\), it’s worse.

15.3 The Discount Factor \(\gamma\)

Why do we discount future rewards?

  1. Convergence: For infinite-horizon problems, \(\sum_{t=0}^\infty r_t\) may diverge. Using \(\gamma < 1\) ensures \(\sum_{t=0}^\infty \gamma^t r_t\) converges.

  2. Credit assignment: Encourages immediate rewards over distant ones – helps learning by focusing on near-term consequences.

  3. Mathematical convenience: Makes the Bellman operator a contraction mapping, guaranteeing convergence of iterative algorithms.

  4. LLM RLHF: Episodes are short (single completion), so often \(\gamma = 1\) (no discounting needed).

15.4 Objective: Expected Return

The RL objective is to maximize expected return: \[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \gamma^t r_t\right] = \mathbb{E}_{s \sim d^\pi, a \sim \pi_\theta}[Q^\pi(s,a)]\]

where \(d^\pi\) is the state distribution induced by policy \(\pi\).

16 Deep Q-Networks (DQN)

16.1 From Q-Learning to DQN

Classical Q-learning maintains a table \(Q(s,a)\) for discrete state-action spaces, updated via: \[Q(s,a) \leftarrow Q(s,a) + \alpha \left[r + \gamma \max_{a'} Q(s',a') - Q(s,a)\right]\]

This fails for large/continuous state spaces (e.g., Atari pixels). DQN (Mnih et al., 2015) approximates \(Q(s,a)\) with a neural network \(Q_\theta(s,a)\), enabling RL in high-dimensional environments.

16.2 Core Innovations

1. Experience Replay: Store transitions \((s, a, r, s')\) in replay buffer \(\mathcal{D}\). Sample random mini-batches to train, breaking temporal correlations.

Why this matters: Sequential samples are highly correlated (same trajectory), causing unstable training. Random sampling makes data i.i.d.-like, stabilizing gradient updates.

2. Target Network: Maintain separate target network \(Q_{\theta^-}\) with frozen weights, updated periodically from \(Q_\theta\).

Loss function: \[\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[\left(r + \gamma \max_{a'} Q_{\theta^-}(s',a') - Q_\theta(s,a)\right)^2\right]\]

Why this matters: Without a target network, both sides of the Bellman update change simultaneously (chasing a moving target), causing oscillations. Freezing \(\theta^-\) for \(C\) steps (e.g., \(C=10{,}000\)) stabilizes training.

16.3 DQN Algorithm

Initialize replay buffer \(\mathcal{D}\) with capacity \(N\) Initialize Q-network \(Q_\theta\) with random weights \(\theta\) Initialize target network \(Q_{\theta^-}\) with \(\theta^- = \theta\) Observe initial state \(s_0\) Select action \(a_t = \begin{cases} \text{random action} & \text{w.p. } \epsilon \\ \arg\max_a Q_\theta(s_t, a) & \text{otherwise} \end{cases}\) Execute \(a_t\), observe reward \(r_t\) and next state \(s_{t+1}\) Store transition \((s_t, a_t, r_t, s_{t+1})\) in \(\mathcal{D}\) Sample random mini-batch of transitions from \(\mathcal{D}\) Compute target: \(y_i = r_i + \gamma \max_{a'} Q_{\theta^-}(s_{i+1}, a')\) Perform gradient descent on \(\mathcal{L}(\theta) = (y_i - Q_\theta(s_i, a_i))^2\) Every \(C\) steps: \(\theta^- \leftarrow \theta\)

16.4 Extensions and Improvements

Double DQN (DDQN): Addresses overestimation bias in \(\max_{a'} Q(s',a')\).

Standard DQN uses same network to select and evaluate action: \[y = r + \gamma Q_{\theta^-}(s', \arg\max_{a'} Q_{\theta^-}(s',a'))\]

DDQN decouples selection (online network) from evaluation (target network): \[y = r + \gamma Q_{\theta^-}(s', \arg\max_{a'} Q_\theta(s',a'))\]

This reduces positive bias from max operator.

Dueling DQN: Decomposes \(Q(s,a) = V(s) + A(s,a)\) where \(V(s)\) is state value and \(A(s,a)\) is advantage.

Architecture: Shared convolutional encoder → split into two streams → combine via \(Q(s,a) = V(s) + A(s,a) - \frac{1}{|A|}\sum_{a'} A(s,a')\) (centering).

Why this helps: For many states, action choice doesn’t matter much. Learning \(V(s)\) separately improves sample efficiency.

Prioritized Experience Replay: Sample transitions with probability proportional to TD error \(|\delta| = |r + \gamma \max_{a'} Q(s',a') - Q(s,a)|\).

High-error transitions provide more learning signal. Corrects bias via importance sampling weights.

16.5 DQN vs Policy Gradient Methods

Aspect DQN (Value-Based) PPO (Policy-Based)
Action space Discrete only Continuous + discrete
Sample efficiency Higher (off-policy) Lower (on-policy)
Stability Requires tricks (replay, target net) More stable (clipping)
Exploration \(\epsilon\)-greedy Stochastic policy
LLM applicability Poor (discrete tokens, but huge action space) Excellent

Why DQN rarely used for LLMs: Vocabulary size is 30K-100K tokens → \(Q(s,a)\) has 100K outputs per state. Softmax over Q-values works, but policy gradient methods (PPO) learn distributions directly, handling large action spaces more naturally.

Note

Historical Impact: DQN achieved human-level Atari game performance from raw pixels (2015), launching the deep RL revolution. While less relevant for LLMs today, its innovations (experience replay, target networks) influenced later algorithms and remain foundational for understanding modern RL.

17 Policy Optimization: From TRPO to DPO

17.1 Foundation: Policy Gradient Theorem

The policy gradient theorem provides the foundation for all policy optimization methods.

The gradient of the expected return with respect to policy parameters \(\theta\) is: \[\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^\pi(s,a)\right]\] where \(Q^\pi(s,a)\) is the action-value function under policy \(\pi\).

Note

Key Insight: We can improve the policy by moving in the direction that increases the log-probability of actions with high Q-values (good outcomes) and decreases the log-probability of actions with low Q-values (bad outcomes).

17.2 Why \(\nabla_\theta \log \pi_\theta\)?

This is the score function or log-derivative trick:

Derivation: \[\begin{align*} \nabla_\theta J(\theta) & = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] \\ & = \nabla_\theta \int \pi_\theta(\tau) R(\tau) \, d\tau \\ & = \int \nabla_\theta \pi_\theta(\tau) R(\tau) \, d\tau \\ & = \int \pi_\theta(\tau) \frac{\nabla_\theta \pi_\theta(\tau)}{\pi_\theta(\tau)} R(\tau) \, d\tau \\ & = \int \pi_\theta(\tau) \nabla_\theta \log \pi_\theta(\tau) R(\tau) \, d\tau \\ & = \mathbb{E}_{\tau \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(\tau) \cdot R(\tau)\right] \end{align*}\]

The key trick: \(\nabla_\theta \pi_\theta = \pi_\theta \nabla_\theta \log \pi_\theta\) (from chain rule on logs).

Why this matters:

  • We can estimate gradients by sampling trajectories – no need to differentiate through the environment dynamics!

  • The \(\log \pi_\theta\) term makes gradients tractable for neural networks

  • This is the foundation of REINFORCE, TRPO, PPO, and ultimately DPO

TipExample

Concrete Example: Why \(\log\)?

Suppose \(\pi_\theta(a|s) = \text{softmax}(f_\theta(s))_a = \frac{e^{f_\theta(s)_a}}{\sum_j e^{f_\theta(s)_j}}\).

Then: \[\log \pi_\theta(a|s) = f_\theta(s)_a - \log \sum_j e^{f_\theta(s)_j}\]

Taking gradient: \[\nabla_\theta \log \pi_\theta(a|s) = \nabla_\theta f_\theta(s)_a - \mathbb{E}_{a' \sim \pi_\theta}[\nabla_\theta f_\theta(s)_{a'}]\]

This is the difference between gradients for the chosen action and the expected gradient – exactly the "reinforce good actions, suppress bad actions" intuition!

17.3 Step 1: Vanilla Policy Gradient (REINFORCE)

The simplest approach uses the policy gradient directly: \[\theta_{t+1} = \theta_t + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t\]

Problems:

  • High variance in gradient estimates

  • Destructive updates: Large steps can catastrophically degrade performance

  • No constraint on how much policy changes between updates

17.4 Step 2: Trust Region Policy Optimization (TRPO)

TRPO addresses destructive updates by constraining policy changes using KL divergence.

\[\max_\theta \mathbb{E}_{s,a \sim \pi_{\theta_{\text{old}}}}\left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\text{old}}}(s,a)\right]\] subject to: \[\mathbb{E}_{s \sim \pi_{\theta_{\text{old}}}}\left[\text{KL}\left(\pi_{\theta_{\text{old}}}(\cdot|s) \,\|\, \pi_\theta(\cdot|s)\right)\right] \leq \delta\]

Key Components:

  • Importance sampling ratio: \(r_t(\theta) = \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}\) reweights old trajectories

  • Advantage function: \(A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)\) measures how much better action \(a\) is than average

  • KL constraint: Ensures new policy \(\pi_\theta\) stays close to old policy \(\pi_{\theta_{\text{old}}}\)

Note

Why KL Divergence? KL measures how much the distribution over actions changes. Keeping it small prevents catastrophic policy collapse. The constraint uses the Fisher information metric from the policy’s probability distribution.

Limitations:

  • Computationally expensive (requires solving constrained optimization with conjugate gradient)

  • Requires second-order derivatives (Hessian-vector products)

  • Difficult to implement correctly

17.5 Step 3: Proximal Policy Optimization (PPO)

PPO simplifies TRPO by replacing the hard KL constraint with a clipped objective.

\[L^{\text{CLIP}}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta) A_t, \, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]\] where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) and \(\epsilon \approx 0.2\).

How Clipping Works:

  • If \(A_t > 0\) (good action): clip \(r_t\) to \([1, 1+\epsilon]\) – limit how much we increase its probability

  • If \(A_t < 0\) (bad action): clip \(r_t\) to \([1-\epsilon, 1]\) – limit how much we decrease its probability

  • Taking \(\min\) chooses the more conservative (pessimistic) objective

TipExample

PPO Clipping Intuition:

Suppose \(\epsilon = 0.2\) and we have an action with \(A_t = +5\) (very good).

  • Unclipped: \(r_t \cdot 5\) could grow arbitrarily large if \(r_t\) is large

  • Clipped: \(\min(r_t \cdot 5, \, 1.2 \cdot 5) = 6\) once \(r_t > 1.2\)

  • Result: PPO stops increasing action probability once it’s 20% more likely than before

This prevents overshooting: even if an action looks great, we don’t want to make it too dominant.

Note

PPO vs. TRPO:

  • TRPO: Hard KL constraint, requires second-order optimization

  • PPO: Soft constraint via clipping, first-order optimization only

  • PPO is simpler, faster, and often performs comparably to TRPO

17.6 Step 4: RLHF with Reward Models

For LLM alignment, we introduce Reinforcement Learning from Human Feedback (RLHF):

  1. Collect preferences: Humans compare LLM outputs: "Response A \(\succ\) Response B"

  2. Train reward model: Use Bradley-Terry model to fit \(r_\theta(x, y)\): \[P(y_1 \succ y_2 \mid x) = \sigma(r_\theta(x, y_1) - r_\theta(x, y_2))\]

  3. Optimize policy: Use PPO to maximize predicted reward: \[\max_\theta \mathbb{E}_{x,y \sim \pi_\theta}\left[r_\theta(x, y)\right] - \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}})\]

The KL penalty \(\beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}})\) prevents the policy from drifting too far from the reference model (avoiding reward hacking and maintaining language quality).

Note

Bradley-Terry Connection: The reward model is trained as a pairwise comparator using logistic regression on preference data. See the Logistic Regression notes for detailed derivation.

17.7 Step 5: Direct Preference Optimization (DPO)

DPO eliminates the reward model entirely by directly optimizing preferences!

The optimal policy under the RLHF objective has a closed form: \[\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r^*(x,y)\right)\]

Rearranging gives the implicit reward: \[r^*(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)\]

Key Idea: Instead of training a reward model \(r_\theta\) and then optimizing it with RL, we can directly parameterize the policy and optimize the Bradley-Terry loss!

Given preference data \((x, y_w, y_l)\) where \(y_w \succ y_l\): \[\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]\]

Derivation Sketch: \[\begin{align*} P(y_w \succ y_l \mid x) & = \sigma(r(x,y_w) - r(x,y_l)) \quad \text{(Bradley-Terry)} \\ & = \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right) \quad \text{(substitute implicit reward)} \end{align*}\]

We maximize this likelihood directly–no separate reward model or RL training loop needed!

Note

DPO Advantages:

  • Simpler: Single-stage training (no reward model, no PPO)

  • Stable: No reward hacking or value network training issues

  • Efficient: Standard supervised learning, easier to implement

  • Interpretable: Direct connection to Bradley-Terry preferences

17.8 Comparison: TRPO → PPO → RLHF → DPO

Method Constraint Complexity Reward Model? RL Loop?
TRPO Hard KL High (2nd order) Optional Yes
PPO Soft (clipping) Medium (1st order) Optional Yes
RLHF+PPO KL penalty Medium Yes Yes
DPO Implicit (via \(\pi_{\text{ref}}\)) Low No No

When to Use Each:

  • TRPO: Strong theoretical guarantees needed, computational cost acceptable

  • PPO: General-purpose RL, good balance of performance and simplicity

  • RLHF+PPO: Complex reward functions, need iterative refinement

  • DPO: High-quality preference data available, want simplicity and stability

17.9 Step 6: Group Relative Policy Optimization (GRPO / RLVR)

GRPO (also called RLVR - RL with Verifiable Rewards) is a recent approach that combines the best of PPO and DPO for LLM alignment.

Given a prompt \(x\):

  1. Sample multiple completions: \(\{y_1, y_2, \ldots, y_K\} \sim \pi_\theta(\cdot|x)\)

  2. Evaluate each with verifiable reward function: \(r(x, y_i)\)

  3. Use group relative advantages to update policy

Key Idea: Instead of training a separate reward model (RLHF) or using only pairwise preferences (DPO), GRPO:

  • Uses a verifiable reward (e.g., code correctness, math verification, rule compliance)

  • Computes advantages relative to the group of sampled completions

  • Applies vanilla policy gradient with group baseline

For each prompt \(x\), sample \(K\) completions and compute: \[\nabla_\theta J(\theta) = \mathbb{E}_{x, \{y_i\}_{i=1}^K \sim \pi_\theta}\left[\sum_{i=1}^K \nabla_\theta \log \pi_\theta(y_i|x) \cdot A_{\text{group}}(x, y_i)\right]\] where the group advantage is: \[A_{\text{group}}(x, y_i) = r(x, y_i) - \frac{1}{K}\sum_{j=1}^K r(x, y_j)\]

Intuition:

  • The baseline \(\frac{1}{K}\sum_{j=1}^K r(x, y_j)\) is the average reward within the sampled group

  • If \(y_i\) is better than average in the group, \(A_{\text{group}}(x, y_i) > 0\) → increase its probability

  • If \(y_i\) is worse than average, \(A_{\text{group}}(x, y_i) < 0\) → decrease its probability

  • No value network needed–the group itself provides the baseline!

TipExample

GRPO Example: Code Generation

For prompt \(x\) = "Write Python function to sort a list":

  • Sample \(K=4\) completions from \(\pi_\theta\)

  • Run unit tests: \(r(x, y_1) = 1.0\) (passes), \(r(x, y_2) = 0.0\) (fails), \(r(x, y_3) = 1.0\), \(r(x, y_4) = 0.5\) (partial)

  • Group baseline: \(\bar{r} = \frac{1.0 + 0.0 + 1.0 + 0.5}{4} = 0.625\)

  • Advantages: \(A_1 = +0.375\), \(A_2 = -0.625\), \(A_3 = +0.375\), \(A_4 = -0.125\)

Update increases probability of \(y_1\) and \(y_3\) (correct solutions), decreases \(y_2\) and \(y_4\).

Comparison to PPO and DPO:

  • vs. PPO: No value network, no clipping, simpler implementation. Group baseline replaces learned value function.

  • vs. DPO: Requires verifiable rewards (not just preferences). Can handle continuous reward signals, not limited to binary comparisons.

  • vs. RLHF: No separate reward model training phase. Reward must be computable (e.g., unit tests, rule checkers).

Note

When to Use GRPO:

  • Verifiable rewards available: Code correctness, math proofs, fact-checking, rule compliance

  • Want simplicity: Easier than PPO (no value network), more flexible than DPO (handles continuous rewards)

  • Sample efficiency matters: Group baseline reduces variance without requiring large replay buffers

  • Online learning: Can update immediately after sampling, no offline preference collection needed

Implementation Details:

  • Typical group size: \(K = 4\) to \(16\) completions per prompt

  • Can add KL penalty to reference model: \(J = \mathbb{E}[\sum A_{\text{group}} \log \pi_\theta] - \beta \text{KL}(\pi_\theta \| \pi_{\text{ref}})\)

  • Often combined with rejection sampling: only update on prompts where at least one completion succeeds

  • Scales well with GPU parallelism (sample multiple completions in parallel)

17.10 Updated Comparison: TRPO → PPO → RLHF → DPO → GRPO

Method Constraint Reward Value Net? Complexity
TRPO Hard KL Any Optional High (2nd order)
PPO Soft (clipping) Any Yes Medium
RLHF+PPO KL penalty Learned (BT) Yes Medium
DPO Implicit (via \(\pi_{\text{ref}}\)) Preferences No Low
GRPO Optional KL Verifiable No Low

Summary:

  • TRPO/PPO: General RL with any reward function, requires value network

  • RLHF: Learns reward from preferences, full RL loop with PPO

  • DPO: Bypasses reward model and RL loop, direct preference optimization

  • GRPO: Bypasses value network, uses group baseline with verifiable rewards

18 Example: Invoice Extraction

18.1 Problem Setup

Goal: Fine-tune an open-source document understanding model (e.g., LayoutLMv3, Donut) to extract structured data from invoices using reinforcement learning.

Note

Why RL for Invoice Extraction?

  • Reward easier to specify than exhaustive labels (business rules)

  • Can optimize complex objectives: accuracy + confidence + hallucination reduction

  • Handles sparse feedback: overall extraction quality vs. per-field annotations

18.2 MDP Formulation

Define the Markov Decision Process as:

  • State \(s\): Invoice image + OCR text + layout features

  • Action \(a\): Generate structured extraction (JSON with fields: vendor, date, total, line items)

  • Reward \(r(s,a)\): Composite score measuring extraction quality

  • Transition: Deterministic (single-step episode per invoice)

18.3 Reward Function Design

The reward function combines multiple components:

\[\begin{align} R(a, y^*) & = w_1 \cdot R_{\text{field}}(a, y^*) + w_2 \cdot R_{\text{struct}}(a) \notag \\ & \quad + w_3 \cdot R_{\text{format}}(a) - w_4 \cdot R_{\text{halluc}}(a) \end{align}\]

where:

  • \(R_{\text{field}}\): F1 score for each extracted field (vendor, date, amounts)

  • \(R_{\text{struct}}\): Structural validation (line items sum to subtotal, tax calculations)

  • \(R_{\text{format}}\): Format compliance (date formats, currency, decimals)

  • \(R_{\text{halluc}}\): Penalty for hallucinated or missing required fields

TipExample

Example Reward Weights: \[R = 0.4 \cdot F1_{\text{fields}} + 0.3 \cdot \text{Validation}_{\text{struct}} + 0.2 \cdot \text{Compliance}_{\text{fmt}} - 0.1 \cdot \text{Penalty}_{\text{halluc}}\]

For an invoice with:

  • Field extraction F1 = 0.9 (correct vendor, date, total)

  • Structural validation = 1.0 (subtotal + tax = total)

  • Format compliance = 0.8 (minor date format issue)

  • No hallucinations = 0

Total reward: \(R = 0.4(0.9) + 0.3(1.0) + 0.2(0.8) - 0.1(0) = 0.82\)

18.4 RL Algorithm: PPO

Proximal Policy Optimization (PPO) is well-suited for this task:

\[\begin{align} L^{\text{CLIP}}(\theta) & = \mathbb{E}(\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)) \\ r_t(\theta) & = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} \\ \hat{A}_t & = R_t - V_\phi(s_t) \end{align}\]

where \(\epsilon = 0.2\) (clip range), \(V_\phi\) is value network, and \(\hat{A}_t\) is advantage estimate.

Note

Understanding the Value Network \(V_\phi\):

\(V_\phi(s_t)\) is a critic trained online alongside the policy–not a pre-trained reward model.

  • Architecture: Shares transformer backbone with policy \(\pi_\theta\), but with a scalar output head (vs. token logits for policy)

  • Training objective: \(L_{\text{value}}(\phi) = \mathbb{E}[(V_\phi(s_t) - R_t)^2]\) – learns to predict expected return

  • Purpose: Provides baseline for advantage \(\hat{A}_t = R_t - V_\phi(s_t)\), reducing variance while keeping gradients unbiased

Key Distinction:

Component Role When Trained
Reward Model \(R\) Scores completions (human preferences) Offline (before RL)
Value Network \(V_\phi\) Estimates expected future reward Online (during RL)
Policy \(\pi_\theta\) Generates tokens Online (during RL)

The advantage \(\hat{A}_t\) tells us: “How much better/worse was this action than expected?” rather than using raw returns which have high variance.

Key PPO Parameters:

  • Learning rate: \(1 \times 10^{-5}\) (small for stability)

  • Batch size: 8 invoices per update

  • Epochs per batch: 3

  • Clip range \(\epsilon\): 0.2

  • Value coefficient: 0.5, Entropy coefficient: 0.01

18.5 Alternative: Direct Preference Optimization (DPO)

DPO simplifies training by using preference pairs without explicit reward model:

\[\begin{align} \mathcal{L}_{\text{DPO}}(\theta) & = -\expect{\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)} \end{align}\]

where \(y_w\) is preferred extraction, \(y_l\) is rejected extraction, \(\beta\) controls strength.

TipExample

Preference Pair Construction:

For invoice with ground truth: {vendor: "Acme Corp", total: $500.00}

Preferred: {vendor: "Acme Corp", total: 500.00}
Rejected: {vendor: "Acme", total: 50.00} (partial name, OCR error)

DPO learns to prefer complete, accurate extractions over common errors.

18.6 Implementation Considerations

  • Model Selection: LayoutLMv3 (vision+text), Donut (end-to-end), Pix2Struct

  • Training Strategy: Start with supervised fine-tuning baseline, then apply RL

  • RL Library: TRL (Transformer Reinforcement Learning), OpenRLHF

  • Infrastructure: PyTorch + DeepSpeed for distributed training

  • Evaluation: Per-field F1, end-to-end success rate, business rule compliance

Note

Challenges:

  • Reward hacking: Model exploits gaps in reward function

  • Sample efficiency: RL needs many iterations; use LoRA for faster experiments

  • Distribution shift: Test robustness on new invoice formats

19 Interview Questions

Note

Q1: What’s the difference between the reward model and the value network in PPO?

A:

  • Reward Model \(R\): Trained offline on human preferences, scores completions (e.g., 0.8 for helpful, 0.3 for harmful)

  • Value Network \(V_\phi\): Trained online during RL, estimates expected future reward from state

  • Purpose: \(V_\phi\) provides baseline for advantage \(\hat{A}_t = R_t - V_\phi(s_t)\), reducing variance

  • Architecture: Both can share transformer backbone, different heads (scalar for value, logits for policy)

Q2: Why does PPO use the clipped objective instead of vanilla policy gradient?

A:

  • Problem: Large policy updates can cause performance collapse

  • Solution: Clip ratio \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) to \([1-\epsilon, 1+\epsilon]\) (typically \(\epsilon=0.2\))

  • Effect: Limits how much policy can change per update, ensuring stable training

  • vs TRPO: PPO simpler to implement (no conjugate gradient), similar performance

Q3: How does RLHF differ from supervised fine-tuning (SFT)?

A:

  • SFT: Learn from demonstrations via cross-entropy loss \(\mathcal{L} = -\log p(y|x)\)

  • RLHF: Optimize reward from human feedback via policy gradient

  • Advantage: RLHF can learn beyond demonstrations (exploration), optimize non-differentiable objectives (safety, helpfulness)

  • Standard pipeline: SFT first (warm start), then RLHF for alignment

Q4: What is DPO and why is it simpler than PPO?

A:

  • DPO (Direct Preference Optimization): Bypasses reward model, optimizes directly on preference pairs

  • Loss: \(\mathcal{L} = -\mathbb{E}[\log \sigma(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)})]\)

  • Benefit: No reward model training, no value network, simpler pipeline

  • Trade-off: Less flexible (requires pairwise data), may underperform PPO on complex tasks

Q5: How do you prevent reward hacking in RLHF?

A:

  • KL penalty: Add \(-\beta \cdot \text{KL}(\pi_\theta || \pi_{\text{ref}})\) to reward to keep policy close to reference

  • Diverse reward models: Train ensemble and use min/average to avoid exploiting single model

  • Iterative refinement: Collect more human feedback on policy outputs, retrain reward model

  • Rule-based constraints: Hard penalties for undesired behaviors (toxicity, hallucination)

Q6: Why use LoRA for RLHF instead of full fine-tuning?

A:

  • Memory: RL stores 2 models (policy + reference), LoRA reduces footprint 10-100×

  • Sample efficiency: LoRA converges faster, enabling more experiments per GPU-hour

  • Multi-task: Can train separate LoRA adapters per task without reloading base model

  • Stability: Freezing base weights reduces catastrophic forgetting risk

Q7: What’s the three-step RLHF pipeline for ChatGPT/Claude?

A:

  1. SFT: Supervised fine-tune on high-quality demonstrations (instruction-following)

  2. Reward Modeling: Train reward model on human preference pairs (A vs B comparisons)

  3. PPO: Optimize policy to maximize reward while staying close to SFT model (KL penalty)

20 Chapter Summary

Core RL Concepts: Reinforcement learning frames LLM training as a Markov Decision Process where states are prompts/contexts, actions are token selections, and rewards come from human feedback. The policy \(\pi_\theta\) (the LLM itself) learns to maximize expected cumulative reward \(\mathbb{E}[\sum \gamma^t r_t]\) through iterative policy updates.

RLHF Pipeline: The standard three-step process starts with supervised fine-tuning on demonstrations, trains a reward model on human preference pairs using Bradley-Terry comparison, then optimizes the policy via PPO with KL-divergence regularization to prevent drift from the reference model. This produces aligned models (ChatGPT, Claude) that follow instructions while avoiding harmful outputs.

PPO Mechanics: Proximal Policy Optimization uses a clipped objective \(\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t, 1\pm\epsilon)\hat{A}_t)\) to limit policy updates, preventing catastrophic performance collapse. The value network \(V_\phi\) trains online to predict expected returns, providing a baseline for advantage estimation \(\hat{A}_t = R_t - V_\phi(s_t)\) that reduces gradient variance.

DPO Alternative: Direct Preference Optimization bypasses the reward model entirely, optimizing directly on preference pairs via a reparameterization trick. While simpler (no reward model, no value network), DPO requires pairwise comparison data and may underperform PPO on complex tasks requiring fine-grained reward shaping.

Practical Considerations: Production RLHF faces reward hacking (model exploits reward function gaps), sample inefficiency (requires many rollouts), and distribution shift (test behavior differs from training). Solutions include KL penalties to constrain exploration, ensemble reward models to prevent exploitation, LoRA for memory efficiency, and iterative human-in-the-loop refinement.

Key Takeaways:

  • RLHF enables optimizing non-differentiable objectives (safety, helpfulness) beyond supervised learning

  • Reward model quality determines alignment ceiling–invest in diverse, high-quality preference data

  • KL penalty \(\beta\) balances exploration vs stability–too high stays near reference, too low causes mode collapse

  • Start with SFT for warm start, use LoRA for sample efficiency, monitor reward hacking continuously

  • DPO works well for simple alignment (harmlessness), PPO better for complex objectives (reasoning + safety)