4 Chapter 3: Logistic & Softmax Regression
5 Logistic Regression: Model and Likelihood
5.1 Model Definition
The logistic regression model transforms a linear combination of features into a probability: \[z = w^\top x + b, \quad p = \sigma(z) = \frac{1}{1 + e^{-z}}\]
where \(\sigma\) is the sigmoid function.
5.2 Bernoulli Likelihood
For binary classification with label \(y \in \{0, 1\}\): \[P(y \mid x) = p^y (1-p)^{1-y}\]
5.3 Loss Function
The negative log-likelihood (binary cross entropy): \[\ell = -\left[ y \log p + (1-y)\log(1-p) \right]\]
Intuition: When \(y=1\), we minimize \(-\log p\) (push \(p \to 1\)). When \(y=0\), we minimize \(-\log(1-p)\) (push \(p \to 0\)).
6 Logistic Regression Gradient
6.1 Derivation
Chain rule computation: \[\frac{\partial \ell}{\partial z} = p - y\] \[\frac{\partial z}{\partial w} = x\]
Final gradient: \[\boxed{ \frac{\partial \ell}{\partial w} = (p - y)x }\]
Key Property: Gradient has simple form \((p - y)x\) – difference between prediction and label, scaled by input.
7 Logistic Regression Hessian
7.1 Second Derivative
\[\frac{\partial^2 \ell}{\partial z^2} = p(1-p)\]
7.2 Hessian with Respect to Weights
Lifting to weight space: \[\boxed{ \nabla^2_{w} \ell = p(1-p)\, x x^\top }\]
Positive Definiteness: Since \(p(1-p) > 0\) for \(p \in (0,1)\) and \(xx^\top\) is positive semidefinite, the Hessian is positive semidefinite \(\Rightarrow\) convex optimization.
8 Softmax Regression
8.1 Model Definition
For multi-class classification with \(K\) classes: \[z_k = w_k^\top x, \quad p_k = \frac{e^{z_k}}{\sum_j e^{z_j}}\]
where \(p_k\) is the probability of class \(k\).
8.2 Cross-Entropy Loss
\[\ell = -\sum_k y_k \log p_k\]
where \(y\) is one-hot encoded label vector.
8.3 Gradient
\[\boxed{ \frac{\partial \ell}{\partial w_k} = (p_k - y_k)x }\]
Parallel to Logistic: Same form as binary logistic regression – difference between predicted probability and true label.
9 Softmax Hessian
9.1 Hessian in Logit Space
\[\boxed{ H_z = \mathrm{diag}(p) - p p^\top }\]
9.2 Hessian in Weight Space
Lifted to weights using Kronecker product: \[\boxed{ \nabla^2_{W} \ell = (\mathrm{diag}(p) - pp^\top) \otimes (xx^\top) }\]
Structure: \(\mathrm{diag}(p) - pp^\top\) is positive semidefinite (Jacobian of softmax), ensuring convexity.
10 Fisher Information
10.1 Definition
The Fisher information matrix measures curvature of log-likelihood.
For a parametric model \(p(y \mid \theta)\), the Fisher information matrix is: \[\boxed{ \mathcal{I}(\theta) = \mathbb{E}_{y \sim p(y|\theta)} \left[ \nabla_\theta \log p(y|\theta) \, \nabla_\theta \log p(y|\theta)^\top \right] }\]
Equivalently (using the identity \(\mathbb{E}[\nabla_\theta \log p(y|\theta)] = 0\)): \[\mathcal{I}(\theta) = -\mathbb{E}_{y \sim p(y|\theta)} \left[ \nabla^2_\theta \log p(y|\theta) \right]\]
For softmax classification with logits \(z\):
Log-likelihood gradient: \[\nabla_z \log p(y|z) = y - p\]
Taking expectation over \(y\): \[\mathbb{E}[yy^\top] = \mathrm{diag}(p)\]
Thus: \[\boxed{ \mathcal{I}(z) = \mathrm{diag}(p) - pp^\top }\]
For exponential family models (including logistic/softmax), the Fisher information matrix equals the Hessian: \[\mathcal{I}(\theta) = H(\theta)\]
11 Natural Gradient vs. Newton
11.1 Update Rules
Natural gradient: \[\Delta\theta_{\text{nat}} = - \mathcal{I}(\theta)^{-1} \nabla_\theta L\]
Newton method: \[\Delta\theta_{\text{Newton}} = -H^{-1}\nabla_\theta L\]
11.2 Equivalence in Exponential Families
In exponential-family models: \[H = \mathcal{I} \quad \Rightarrow \quad \text{natural gradient = Newton step}\]
Implication: Natural gradient provides geometrically invariant updates that coincide with second-order optimization in exponential families.
12 KL Divergence Preview for RL
12.1 Local Quadratic Approximation
The KL divergence has a local quadratic form: \[\mathrm{KL}(p_\theta \;\|\; p_{\theta + d\theta}) = \frac{1}{2} d\theta^\top \mathcal{I} d\theta + O(\|d\theta\|^3)\]
Connection to RL:
KL acts as a global divergence measuring distribution mismatch
Fisher information \(\mathcal{I}\) acts as a local metric (Riemannian metric)
Used in TRPO/PPO to constrain policy updates
13 Bradley-Terry Model: Learning from Preferences
13.1 Motivation: Ranking Without Explicit Labels
The Bradley-Terry (BT) model addresses a fundamental problem: How do we learn quality scores for items when we only observe pairwise preferences, not explicit labels?
Key Insight: Rather than requiring explicit labels \(y \in \{0,1\}\), we observe comparison outcomes: "Item \(i\) is preferred over item \(j\)." From these preferences, we infer latent quality scores via logistic regression.
13.2 Model Formulation
Assign each item \(i\) a skill parameter or quality score \(\theta_i \in \mathbb{R}\).
The probability that item \(i\) beats item \(j\) in a pairwise comparison is: \[P(i \succ j \mid \theta) = \frac{e^{\theta_i}}{e^{\theta_i} + e^{\theta_j}} = \frac{1}{1 + e^{-(\theta_i - \theta_j)}} = \sigma(\theta_i - \theta_j)\]
This is logistic regression in disguise: the score difference \(z = \theta_i - \theta_j\) is the logit, and we model the win probability via the sigmoid function.
13.3 Connection to Logistic Regression
Feature vector: \(x = e_i - e_j\) (indicator: +1 for item \(i\), -1 for item \(j\))
Linear model: \(z = \theta^\top x = \theta_i - \theta_j\)
Label: \(y = 1\) if \(i\) wins, \(y = 0\) if \(j\) wins
Likelihood: Standard binary logistic regression
Example: Chess Players
Suppose we observe:
Player A beats Player B (3 times)
Player B beats Player C (2 times)
Player A beats Player C (4 times)
We estimate skills \(\theta_A, \theta_B, \theta_C\) by maximizing: \[\mathcal{L}(\theta) = 3 \log P(A \succ B) + 2 \log P(B \succ C) + 4 \log P(A \succ C)\]
This is logistic regression with no explicit labels – just pairwise preference data!
13.4 Maximum Likelihood Estimation
Given \(n\) pairwise comparisons \((i_k \succ j_k)\) for \(k=1,\ldots,n\), the log-likelihood is: \[\log \mathcal{L}(\theta) = \sum_{k=1}^n \log \sigma(\theta_{i_k} - \theta_{j_k})\]
13.5 Gradient Derivation
Taking the derivative with respect to \(\theta_i\), we identify terms where \(\theta_i\) appears:
Case 1: When \(i\) wins (comparison \(k\) where \(i_k = i\)): \[\frac{\partial}{\partial \theta_i} \log \sigma(\theta_i - \theta_{j_k}) = 1 - \sigma(\theta_i - \theta_{j_k})\]
Case 2: When \(i\) loses (comparison \(k\) where \(j_k = i\)): \[\frac{\partial}{\partial \theta_i} \log \sigma(\theta_{i_k} - \theta_i) = -[1 - \sigma(\theta_{i_k} - \theta_i)] = \sigma(\theta_{i_k} - \theta_i) - 1\]
Combining both cases: \[\begin{align*} \frac{\partial \log \mathcal{L}}{\partial \theta_i} & = \sum_{k: i_k = i} [1 - \sigma(\theta_i - \theta_{j_k})] + \sum_{k: j_k = i} [\sigma(\theta_{i_k} - \theta_i) - 1] \\ & = \underbrace{\sum_{k: i_k = i} 1 - \sum_{k: j_k = i} 1}_{\text{constant (wins - losses)}} - \sum_{k: i_k = i} \sigma(\theta_i - \theta_{j_k}) + \sum_{k: j_k = i} \sigma(\theta_{i_k} - \theta_i) \end{align*}\]
The constant term (wins minus losses) doesn’t depend on \(\theta\), so we can drop it from the gradient. The final result: \[\boxed{\frac{\partial \log \mathcal{L}}{\partial \theta_i} = \sum_{k: i_k = i} \left(1 - \sigma(\theta_i - \theta_{j_k})\right) - \sum_{k: j_k = i} \sigma(\theta_{i_k} - \theta_i)}\]
Interpretation: This is identical to logistic regression’s observed wins minus expected wins:
First sum: games where \(i\) won, weighted by \((1 - p_{\text{win}})\) = how surprising each win was
Second sum: games where \(i\) lost, weighted by \(p_{\text{win}}\) = how surprising each loss was
At optimum (gradient = 0), expected performance matches observed performance
13.6 Elo Rating System
The Elo rating system (used in chess, competitive games) is an online approximation of the Bradley-Terry model:
After player \(i\) plays player \(j\) with outcome \(s_{ij} \in \{0, 1\}\) (win/loss): \[\theta_i^{\text{new}} = \theta_i^{\text{old}} + K \cdot \left(s_{ij} - P(i \succ j)\right)\] where \(K\) is the learning rate (typically 16–32), and \(P(i \succ j) = \sigma(\theta_i - \theta_j)\).
Elo = Stochastic Gradient Descent on Bradley-Terry:
Each game is one training example
Update is gradient step: \((s_{ij} - p_{ij})\) matches \((\text{observed} - \text{predicted})\)
Elo provides online learning without storing full comparison history
13.7 Connection to RLHF and Reward Modeling
The Bradley-Terry model is the foundation for Reinforcement Learning from Human Feedback (RLHF):
Collect preferences: Humans compare two LLM outputs: "Response A is better than Response B"
Fit reward model: Use Bradley-Terry to estimate quality scores \(r_\theta(x, y)\) for prompt-response pairs
Optimize policy: Train LLM to maximize predicted reward via RL (PPO, DPO)
RLHF Reward Modeling:
Given prompt \(x\) and two responses \(y_1, y_2\) with preference \(y_1 \succ y_2\), we model: \[P(y_1 \succ y_2 \mid x) = \sigma(r_\theta(x, y_1) - r_\theta(x, y_2))\]
where \(r_\theta\) is a neural network (reward model). Training minimizes: \[\mathcal{L}(\theta) = -\mathbb{E}_{(x, y_1, y_2)} \left[\log \sigma(r_\theta(x, y_1) - r_\theta(x, y_2))\right]\]
This is Bradley-Terry logistic regression with learned features \(r_\theta\)!
13.8 Key Properties
No explicit labels: Learn from relative comparisons, not absolute scores
Probabilistic: Captures uncertainty in preferences (upsets can happen)
Transitive but noisy: If \(\theta_A > \theta_B > \theta_C\), then \(P(A \succ C)\) is high but not 1
Identifiability: Scores are unique up to a constant (can fix \(\theta_1 = 0\) or \(\sum_i \theta_i = 0\))