4 Chapter 3: Logistic & Softmax Regression

5 Logistic Regression: Model and Likelihood

5.1 Model Definition

The logistic regression model transforms a linear combination of features into a probability: \[z = w^\top x + b, \quad p = \sigma(z) = \frac{1}{1 + e^{-z}}\]

where \(\sigma\) is the sigmoid function.

5.2 Bernoulli Likelihood

For binary classification with label \(y \in \{0, 1\}\): \[P(y \mid x) = p^y (1-p)^{1-y}\]

5.3 Loss Function

The negative log-likelihood (binary cross entropy): \[\ell = -\left[ y \log p + (1-y)\log(1-p) \right]\]

Note

Intuition: When \(y=1\), we minimize \(-\log p\) (push \(p \to 1\)). When \(y=0\), we minimize \(-\log(1-p)\) (push \(p \to 0\)).

6 Logistic Regression Gradient

6.1 Derivation

Chain rule computation: \[\frac{\partial \ell}{\partial z} = p - y\] \[\frac{\partial z}{\partial w} = x\]

Final gradient: \[\boxed{ \frac{\partial \ell}{\partial w} = (p - y)x }\]

Note

Key Property: Gradient has simple form \((p - y)x\) – difference between prediction and label, scaled by input.

7 Logistic Regression Hessian

7.1 Second Derivative

\[\frac{\partial^2 \ell}{\partial z^2} = p(1-p)\]

7.2 Hessian with Respect to Weights

Lifting to weight space: \[\boxed{ \nabla^2_{w} \ell = p(1-p)\, x x^\top }\]

Note

Positive Definiteness: Since \(p(1-p) > 0\) for \(p \in (0,1)\) and \(xx^\top\) is positive semidefinite, the Hessian is positive semidefinite \(\Rightarrow\) convex optimization.

8 Softmax Regression

8.1 Model Definition

For multi-class classification with \(K\) classes: \[z_k = w_k^\top x, \quad p_k = \frac{e^{z_k}}{\sum_j e^{z_j}}\]

where \(p_k\) is the probability of class \(k\).

8.2 Cross-Entropy Loss

\[\ell = -\sum_k y_k \log p_k\]

where \(y\) is one-hot encoded label vector.

8.3 Gradient

\[\boxed{ \frac{\partial \ell}{\partial w_k} = (p_k - y_k)x }\]

Note

Parallel to Logistic: Same form as binary logistic regression – difference between predicted probability and true label.

9 Softmax Hessian

9.1 Hessian in Logit Space

\[\boxed{ H_z = \mathrm{diag}(p) - p p^\top }\]

9.2 Hessian in Weight Space

Lifted to weights using Kronecker product: \[\boxed{ \nabla^2_{W} \ell = (\mathrm{diag}(p) - pp^\top) \otimes (xx^\top) }\]

Note

Structure: \(\mathrm{diag}(p) - pp^\top\) is positive semidefinite (Jacobian of softmax), ensuring convexity.

10 Fisher Information

10.1 Definition

The Fisher information matrix measures curvature of log-likelihood.

For a parametric model \(p(y \mid \theta)\), the Fisher information matrix is: \[\boxed{ \mathcal{I}(\theta) = \mathbb{E}_{y \sim p(y|\theta)} \left[ \nabla_\theta \log p(y|\theta) \, \nabla_\theta \log p(y|\theta)^\top \right] }\]

Equivalently (using the identity \(\mathbb{E}[\nabla_\theta \log p(y|\theta)] = 0\)): \[\mathcal{I}(\theta) = -\mathbb{E}_{y \sim p(y|\theta)} \left[ \nabla^2_\theta \log p(y|\theta) \right]\]

For softmax classification with logits \(z\):

Log-likelihood gradient: \[\nabla_z \log p(y|z) = y - p\]

Taking expectation over \(y\): \[\mathbb{E}[yy^\top] = \mathrm{diag}(p)\]

Thus: \[\boxed{ \mathcal{I}(z) = \mathrm{diag}(p) - pp^\top }\]

For exponential family models (including logistic/softmax), the Fisher information matrix equals the Hessian: \[\mathcal{I}(\theta) = H(\theta)\]

11 Natural Gradient vs. Newton

11.1 Update Rules

Natural gradient: \[\Delta\theta_{\text{nat}} = - \mathcal{I}(\theta)^{-1} \nabla_\theta L\]

Newton method: \[\Delta\theta_{\text{Newton}} = -H^{-1}\nabla_\theta L\]

11.2 Equivalence in Exponential Families

In exponential-family models: \[H = \mathcal{I} \quad \Rightarrow \quad \text{natural gradient = Newton step}\]

Note

Implication: Natural gradient provides geometrically invariant updates that coincide with second-order optimization in exponential families.

12 KL Divergence Preview for RL

12.1 Local Quadratic Approximation

The KL divergence has a local quadratic form: \[\mathrm{KL}(p_\theta \;\|\; p_{\theta + d\theta}) = \frac{1}{2} d\theta^\top \mathcal{I} d\theta + O(\|d\theta\|^3)\]

Note

Connection to RL:

KL acts as a global divergence measuring distribution mismatch
Fisher information \(\mathcal{I}\) acts as a local metric (Riemannian metric)
Used in TRPO/PPO to constrain policy updates

13 Bradley-Terry Model: Learning from Preferences

13.1 Motivation: Ranking Without Explicit Labels

The Bradley-Terry (BT) model addresses a fundamental problem: How do we learn quality scores for items when we only observe pairwise preferences, not explicit labels?

Note

Key Insight: Rather than requiring explicit labels \(y \in \{0,1\}\), we observe comparison outcomes: "Item \(i\) is preferred over item \(j\)." From these preferences, we infer latent quality scores via logistic regression.

13.2 Model Formulation

Assign each item \(i\) a skill parameter or quality score \(\theta_i \in \mathbb{R}\).

The probability that item \(i\) beats item \(j\) in a pairwise comparison is: \[P(i \succ j \mid \theta) = \frac{e^{\theta_i}}{e^{\theta_i} + e^{\theta_j}} = \frac{1}{1 + e^{-(\theta_i - \theta_j)}} = \sigma(\theta_i - \theta_j)\]

This is logistic regression in disguise: the score difference \(z = \theta_i - \theta_j\) is the logit, and we model the win probability via the sigmoid function.

13.3 Connection to Logistic Regression

Feature vector: \(x = e_i - e_j\) (indicator: +1 for item \(i\), -1 for item \(j\))
Linear model: \(z = \theta^\top x = \theta_i - \theta_j\)
Label: \(y = 1\) if \(i\) wins, \(y = 0\) if \(j\) wins
Likelihood: Standard binary logistic regression

Example

Example: Chess Players

Suppose we observe:

Player A beats Player B (3 times)
Player B beats Player C (2 times)
Player A beats Player C (4 times)

We estimate skills \(\theta_A, \theta_B, \theta_C\) by maximizing: \[\mathcal{L}(\theta) = 3 \log P(A \succ B) + 2 \log P(B \succ C) + 4 \log P(A \succ C)\]

This is logistic regression with no explicit labels – just pairwise preference data!

13.4 Maximum Likelihood Estimation

Given \(n\) pairwise comparisons \((i_k \succ j_k)\) for \(k=1,\ldots,n\), the log-likelihood is: \[\log \mathcal{L}(\theta) = \sum_{k=1}^n \log \sigma(\theta_{i_k} - \theta_{j_k})\]

13.5 Gradient Derivation

Taking the derivative with respect to \(\theta_i\), we identify terms where \(\theta_i\) appears:

Case 1: When \(i\) wins (comparison \(k\) where \(i_k = i\)): \[\frac{\partial}{\partial \theta_i} \log \sigma(\theta_i - \theta_{j_k}) = 1 - \sigma(\theta_i - \theta_{j_k})\]

Case 2: When \(i\) loses (comparison \(k\) where \(j_k = i\)): \[\frac{\partial}{\partial \theta_i} \log \sigma(\theta_{i_k} - \theta_i) = -[1 - \sigma(\theta_{i_k} - \theta_i)] = \sigma(\theta_{i_k} - \theta_i) - 1\]

Combining both cases: \[\begin{align*} \frac{\partial \log \mathcal{L}}{\partial \theta_i} & = \sum_{k: i_k = i} [1 - \sigma(\theta_i - \theta_{j_k})] + \sum_{k: j_k = i} [\sigma(\theta_{i_k} - \theta_i) - 1] \\ & = \underbrace{\sum_{k: i_k = i} 1 - \sum_{k: j_k = i} 1}_{\text{constant (wins - losses)}} - \sum_{k: i_k = i} \sigma(\theta_i - \theta_{j_k}) + \sum_{k: j_k = i} \sigma(\theta_{i_k} - \theta_i) \end{align*}\]

The constant term (wins minus losses) doesn’t depend on \(\theta\), so we can drop it from the gradient. The final result: \[\boxed{\frac{\partial \log \mathcal{L}}{\partial \theta_i} = \sum_{k: i_k = i} \left(1 - \sigma(\theta_i - \theta_{j_k})\right) - \sum_{k: j_k = i} \sigma(\theta_{i_k} - \theta_i)}\]

Note

Interpretation: This is identical to logistic regression’s observed wins minus expected wins:

First sum: games where \(i\) won, weighted by \((1 - p_{\text{win}})\) = how surprising each win was
Second sum: games where \(i\) lost, weighted by \(p_{\text{win}}\) = how surprising each loss was
At optimum (gradient = 0), expected performance matches observed performance

13.6 Elo Rating System

The Elo rating system (used in chess, competitive games) is an online approximation of the Bradley-Terry model:

After player \(i\) plays player \(j\) with outcome \(s_{ij} \in \{0, 1\}\) (win/loss): \[\theta_i^{\text{new}} = \theta_i^{\text{old}} + K \cdot \left(s_{ij} - P(i \succ j)\right)\] where \(K\) is the learning rate (typically 16–32), and \(P(i \succ j) = \sigma(\theta_i - \theta_j)\).

Note

Elo = Stochastic Gradient Descent on Bradley-Terry:

Each game is one training example
Update is gradient step: \((s_{ij} - p_{ij})\) matches \((\text{observed} - \text{predicted})\)
Elo provides online learning without storing full comparison history

13.7 Connection to RLHF and Reward Modeling

The Bradley-Terry model is the foundation for Reinforcement Learning from Human Feedback (RLHF):

Collect preferences: Humans compare two LLM outputs: "Response A is better than Response B"
Fit reward model: Use Bradley-Terry to estimate quality scores \(r_\theta(x, y)\) for prompt-response pairs
Optimize policy: Train LLM to maximize predicted reward via RL (PPO, DPO)

Example

RLHF Reward Modeling:

Given prompt \(x\) and two responses \(y_1, y_2\) with preference \(y_1 \succ y_2\), we model: \[P(y_1 \succ y_2 \mid x) = \sigma(r_\theta(x, y_1) - r_\theta(x, y_2))\]

where \(r_\theta\) is a neural network (reward model). Training minimizes: \[\mathcal{L}(\theta) = -\mathbb{E}_{(x, y_1, y_2)} \left[\log \sigma(r_\theta(x, y_1) - r_\theta(x, y_2))\right]\]

This is Bradley-Terry logistic regression with learned features \(r_\theta\)!

13.8 Key Properties

No explicit labels: Learn from relative comparisons, not absolute scores
Probabilistic: Captures uncertainty in preferences (upsets can happen)
Transitive but noisy: If \(\theta_A > \theta_B > \theta_C\), then \(P(A \succ C)\) is high but not 1
Identifiability: Scores are unique up to a constant (can fix \(\theta_1 = 0\) or \(\sum_i \theta_i = 0\))