5 Chapter 4: Optimization Algorithms for Deep Learning
6 Introduction: The Optimization Landscape
Training neural networks requires iteratively updating parameters \(\theta\) to minimize loss \(\mathcal{L}(\theta)\). The choice of optimizer profoundly affects:
Convergence speed: How quickly loss decreases
Final performance: Generalization to test data
Training stability: Avoiding divergence, especially at scale
Memory overhead: Per-parameter state storage
Evolution:
SGD (1951): Simple gradient descent with momentum
Adaptive Learning Rates (2011-2014): AdaGrad, RMSProp–scale gradients per-parameter
Adam Era (2015): Combines momentum + adaptive rates–became default for deep learning
Regularization Fixes (2017): AdamW decouples weight decay from gradient updates
Modern Variants (2023-2024): Muon, Lion, Sophia–optimized for LLM-scale training
Modern LLM Practice:
Pre-training: AdamW with \(\beta_1=0.9, \beta_2=0.95\) (GPT-3, LLaMA, Qwen)
Fine-tuning: AdamW or SGD with warmup + cosine decay
Emerging: Muon (momentum + orthogonalization) for memory efficiency
7 Stochastic Gradient Descent (SGD)
7.1 Vanilla SGD
Update Rule: \[\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)\] where \(\eta\) is the learning rate.
Properties:
+ Minimal memory: no per-parameter state
+ Well-understood theory
- Sensitive to learning rate choice
- Slow convergence, especially for ill-conditioned problems
7.2 SGD with Momentum
Accumulates velocity to smooth updates and accelerate in consistent directions:
\[\begin{align} v_{t+1} & = \beta v_t + \nabla_\theta \mathcal{L}(\theta_t) \\ \theta_{t+1} & = \theta_t - \eta v_{t+1} \end{align}\]
where \(\eta\) is the learning rate (step size), \(\beta\) is the momentum coefficient (typically 0.9), and \(v_t\) is the velocity (accumulated update direction).
Common choice: \(\beta = 0.9\) (retains 90% of previous velocity)
Intuition:
Acts like a ball rolling downhill–accumulates momentum in consistent directions
Dampens oscillations in high-curvature directions
Can overshoot minima but often reaches better generalization
7.3 Nesterov Accelerated Gradient (NAG)
Looks ahead before computing gradient: \[\begin{align} v_{t+1} & = \beta v_t + \nabla_\theta \mathcal{L}(\theta_t - \eta \beta v_t) \\ \theta_{t+1} & = \theta_t - \eta v_{t+1} \end{align}\]
Advantage: Better correction when approaching minima (gradient computed at “lookahead” position).
8 Adaptive Learning Rate Methods
8.1 AdaGrad (2011)
Adapts learning rate per parameter based on historical gradient magnitudes.
Update Rule: \[\begin{align} G_t & = G_{t-1} + \nabla_\theta \mathcal{L}(\theta_t) \odot \nabla_\theta \mathcal{L}(\theta_t) \quad \text{(accumulate squared gradients)} \\ \theta_{t+1} & = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \odot \nabla_\theta \mathcal{L}(\theta_t) \end{align}\]
where \(\odot\) is element-wise product, \(\epsilon \approx 10^{-8}\) for numerical stability.
Intuition:
Parameters with large cumulative gradients get smaller learning rates
Parameters with small cumulative gradients get larger learning rates
Useful for sparse features (NLP, recommendation systems)
Problem: Learning rate monotonically decreases–can stop learning prematurely in deep networks.
8.2 RMSProp (2012)
Fixes AdaGrad’s monotonic decay by using exponential moving average of squared gradients.
Update Rule: \[\begin{align} v_t & = \beta v_{t-1} + (1-\beta) \nabla_\theta \mathcal{L}(\theta_t) \odot \nabla_\theta \mathcal{L}(\theta_t) \\ \theta_{t+1} & = \theta_t - \frac{\eta}{\sqrt{v_t + \epsilon}} \odot \nabla_\theta \mathcal{L}(\theta_t) \end{align}\]
Common choice: \(\beta = 0.9\) (decay rate for moving average)
Advantage: Discounts old gradients–learning rate adapts to recent gradient history, not entire history.
9 Adam: Adaptive Moment Estimation (2015)
Adam combines momentum (first moment) with RMSProp’s adaptive learning rates (second moment).
9.1 Adam Algorithm
Hyperparameters:
\(\eta\): learning rate (often \(10^{-3}\) to \(10^{-4}\))
\(\beta_1 = 0.9\): exponential decay for first moment (momentum)
\(\beta_2 = 0.999\): exponential decay for second moment (adaptive rate)
\(\epsilon = 10^{-8}\): numerical stability
Update Rule: \[\begin{align} m_t & = \beta_1 m_{t-1} + (1-\beta_1) \nabla_\theta \mathcal{L}(\theta_t) \quad \text{(first moment, momentum)} \\ v_t & = \beta_2 v_{t-1} + (1-\beta_2) \nabla_\theta \mathcal{L}(\theta_t)^2 \quad \text{(second moment, adaptive rate)} \\ \hat{m}_t & = \frac{m_t}{1 - \beta_1^t} \quad \text{(bias correction for first moment)} \\ \hat{v}_t & = \frac{v_t}{1 - \beta_2^t} \quad \text{(bias correction for second moment)} \\ \theta_{t+1} & = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t \end{align}\]
Why Bias Correction?
Since \(m_0 = v_0 = 0\), early estimates are biased toward zero. Dividing by \((1 - \beta^t)\) compensates:
At \(t=1\): \(1 - \beta_1^1 = 0.1\), so \(\hat{m}_1 = m_1 / 0.1 = 10 m_1\) (amplify small initial moment)
As \(t \to \infty\): \(\beta_1^t \to 0\), correction factor \(\to 1\) (no correction needed)
9.2 Adam’s Strengths and Weaknesses
Strengths:
+ Works well out-of-the-box for wide range of problems
+ Robust to hyperparameter choices (default \(\beta_1, \beta_2\) often sufficient)
+ Fast initial convergence
+ Handles sparse gradients well
Weaknesses:
- Can converge to worse solutions than SGD+momentum in some cases (generalization gap)
- Weight decay implementation was incorrect in original paper (fixed in AdamW)
- Memory: requires 2\(\times\) parameters for \(m_t, v_t\) storage
10 AdamW: Adam with Decoupled Weight Decay (2017)
10.1 The Weight Decay Problem in Adam
Terminology Clarification:
L2 regularization: Add \(\frac{\lambda}{2}\|\theta\|^2\) penalty to loss function
Weight decay: Directly shrink parameters by factor \((1 - \lambda)\) each step
For SGD, these are equivalent. For adaptive optimizers (Adam), they are not!
L2 Regularization Approach:
Add penalty to loss: \[\mathcal{L}_{\text{reg}}(\theta) = \mathcal{L}(\theta) + \frac{\lambda}{2} \|\theta\|^2\]
Taking gradient (applies to all parameters including weights \(W\) and biases \(b\)): \[\nabla_\theta \mathcal{L}_{\text{reg}} = \nabla_\theta \mathcal{L} + \lambda \theta\]
Problem in Adam: This \(\lambda \theta\) term goes through adaptive scaling by \(\frac{1}{\sqrt{\hat{v}_t}}\), which weakens the regularization effect for parameters with large gradients.
Example:
Parameter \(\theta_1\) has large gradients \(\Rightarrow\) large \(\hat{v}_t\) \(\Rightarrow\) \(\lambda\theta_1\) gets divided by large \(\sqrt{\hat{v}_t}\) \(\Rightarrow\) weak regularization
Parameter \(\theta_2\) has small gradients \(\Rightarrow\) small \(\hat{v}_t\) \(\Rightarrow\) \(\lambda\theta_2\) gets divided by small \(\sqrt{\hat{v}_t}\) \(\Rightarrow\) strong regularization
Inconsistent regularization strength across parameters!
10.2 AdamW Solution: Decouple Weight Decay
Instead of adding \(\lambda \theta\) to gradient (L2 regularization), apply weight decay directly to parameters:
\[\begin{align} m_t & = \beta_1 m_{t-1} + (1-\beta_1) \nabla_\theta \mathcal{L}(\theta_t) \\ v_t & = \beta_2 v_{t-1} + (1-\beta_2) \nabla_\theta \mathcal{L}(\theta_t)^2 \\ \hat{m}_t & = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \\ \theta_{t+1} & = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right) \end{align}\]
Key Difference: \(\lambda \theta_t\) is added after adaptive scaling (not before), ensuring consistent regularization strength across all parameters.
Equivalently (clearer form): \[\begin{align} \theta_{t+1} & = (1 - \eta\lambda) \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{align}\]
This shows weight decay as direct parameter shrinkage by factor \((1 - \eta\lambda)\), independent of gradient magnitude.
Why “Weight Decay” vs “L2 Regularization”?
L2 regularization: Penalty on loss \(\Rightarrow\) affects gradients \(\Rightarrow\) goes through optimizer’s adaptive scaling
Weight decay: Direct parameter shrinkage \(\Rightarrow\) bypasses optimizer \(\Rightarrow\) uniform regularization
For SGD: These are equivalent (no adaptive scaling)
For Adam/AdamW: Weight decay is the correct approach
10.3 AdamW in Modern LLMs
Standard Configuration (GPT-3, LLaMA, Qwen):
\(\eta = 6 \times 10^{-4}\) (peak learning rate after warmup)
\(\beta_1 = 0.9\)
\(\beta_2 = 0.95\) (lower than default 0.999 for better stability at scale)
\(\lambda = 0.1\) (weight decay)
Warmup: Linear increase over 2,000 steps
Decay: Cosine decay to 10% of peak LR
Why \(\beta_2 = 0.95\) for LLMs?
Lower \(\beta_2\) means faster adaptation to recent gradient changes
Helps with stability when training with very large batch sizes (millions of tokens)
Empirically better for long training runs (multiple epochs over trillions of tokens)
11 Muon: Momentum Orthogonalized by Newton (2024)
Muon is a recent optimizer designed for memory-efficient LLM training.
11.1 Motivation
Problem with Adam/AdamW:
Requires storing \(m_t, v_t\) for every parameter (2\(\times\) memory overhead)
For 70B model with BF16: params (140GB) + optimizer state (280GB) = 420GB total
This limits training on memory-constrained hardware
11.2 Muon Algorithm
Key Ideas:
Momentum (like Adam’s first moment): \(m_t = \beta m_{t-1} + (1-\beta) \nabla_\theta \mathcal{L}\)
Newton-like scaling: Use approximate Hessian information via orthogonalization
Memory efficiency: Store only \(m_t\) (not second moment \(v_t\))
Simplified Update: \[\begin{align} m_t & = \beta m_{t-1} + (1-\beta) \nabla_\theta \mathcal{L}(\theta_t) \\ m_t^{\text{orth}} & = \text{Orthogonalize}(m_t) \quad \text{(project away high-curvature directions)} \\ \theta_{t+1} & = \theta_t - \eta m_t^{\text{orth}} \end{align}\]
Orthogonalization: Geometry Without the Hessian
Core idea: Remove momentum components aligned with high-curvature directions to prevent oscillations in sharp valleys.
Assume local quadratic structure: \(L(\theta + \delta) \approx L(\theta) + g^\top \delta + \frac{1}{2}\delta^\top H \delta\). High-curvature eigenvectors of \(H\) cause instability. Newton uses \(\delta = H^{-1}g\), but computing \(H\) is prohibitive.
Orthogonalization approximates this by deflating sharp directions: \[m_t^{\text{orth}} = (I - U_k U_k^\top) m_t\] where \(U_k\) spans dominant curvature directions (estimated without \(H\)).
Practical implementations (cheapest to strongest):
| Method | What it does | Used in |
|---|---|---|
| Layer-wise normalization | \(m_\ell / \|m_\ell\|_2\) (per-layer trust) | LAMB, LARS |
| RMS-based rescaling | \(D_t^{-1/2} m_t\) where \(D_t \approx g_t^2\) | Adam (implicitly) |
| Blockwise QR | \(m_t - Q(Q^\top m_t)\) on small blocks | Shampoo, Muon |
| Power iteration on \(Hv\) | Find top eigenvector via autodiff | Sophia-style |
Why this works: Momentum aligns with long-term descent but amplifies oscillations. Orthogonalization removes high-curvature components, keeping motion along flat, generalizable directions–equivalent to partial Newton step + trust region.
Relation to known optimizers:
SGD+momentum: Low-pass filter on gradients
Adam: Diagonal curvature whitening (via \(v_t\))
Shampoo: Blockwise Newton step
Muon: Momentum + subspace deflation (Adam without \(v_t\), geometry-aware cleaning)
11.3 Advantages and Tradeoffs
Advantages:
+ 50% less optimizer memory than AdamW (stores only \(m_t\), not \(v_t\))
+ Comparable convergence to AdamW on LLM pre-training
+ Better conditioning via orthogonalization (like Newton’s method)
Tradeoffs:
- More compute per step (orthogonalization overhead)
- Less mature–fewer hyperparameter recipes than AdamW
- Not yet widely adopted in production systems
When to Consider Muon:
Training very large models (70B+) where optimizer memory is limiting
Hardware with limited HBM (e.g., older GPUs, edge training)
Willing to experiment with hyperparameters for potential gains
12 Comparison: Which Optimizer to Use?
| Optimizer | Memory | Convergence | Use Case | Modern LLMs? |
|---|---|---|---|---|
| SGD + Momentum | Low | Slow | CNNs, fine-tuning | Rare |
| AdaGrad | Medium | Moderate | Sparse features | No |
| RMSProp | Medium | Fast | RNNs, RL | Rare |
| Adam | High (2\(\times\)) | Fast | General DL | Yes (legacy) |
| AdamW | High (2\(\times\)) | Fast | Transformers, LLMs | Standard |
| Muon | Medium (1\(\times\)) | Fast | Memory-constrained | Emerging |
12.1 Practical Recommendations
Default Choice: AdamW with standard hyperparameters
\(\eta = 3 \times 10^{-4}\) (small models), \(6 \times 10^{-4}\) (large LLMs)
\(\beta_1 = 0.9\), \(\beta_2 = 0.95\) (LLMs) or \(0.999\) (smaller models)
Weight decay \(\lambda = 0.1\)
Warmup + cosine decay schedule
When to Use SGD + Momentum:
Computer vision (ResNets, ViTs) where SGD often generalizes better
Fine-tuning pre-trained models with small LR
When memory is extremely constrained
When to Experiment with Muon:
Training 70B+ models with limited GPU memory
Research settings exploring memory-efficient optimizers
13 Learning Rate Schedules
The learning rate \(\eta\) typically varies during training. Common schedules:
13.1 Warmup
Motivation: Large initial gradients can destabilize training. Warmup gradually increases LR from 0 to target.
Linear Warmup: \[\eta_t = \eta_{\max} \cdot \frac{t}{T_{\text{warmup}}} \quad \text{for } t \leq T_{\text{warmup}}\]
Typical: 2,000-10,000 steps for LLMs
13.2 Cosine Decay
After warmup, decay learning rate following cosine curve: \[\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min}) \left(1 + \cos\left(\frac{t - T_{\text{warmup}}}{T_{\text{total}} - T_{\text{warmup}}} \pi\right)\right)\]
Common choice: \(\eta_{\min} = 0.1 \cdot \eta_{\max}\) (decay to 10% of peak)
13.3 Step Decay
Reduce LR by factor (e.g., 0.1) at fixed intervals: \[\eta_t = \eta_0 \cdot \gamma^{\lfloor t / T_{\text{step}} \rfloor}\]
Less common in modern LLM training (cosine decay preferred).
14 Interview Cheat Sheet
14.1 Quick Facts
SGD:
“SGD with momentum (\(\beta=0.9\)) still SOTA for vision–often better generalization than Adam.”
“Nesterov looks ahead before computing gradient–better correction near minima.”
Adam/AdamW:
“Adam combines momentum (\(\beta_1=0.9\)) with adaptive per-parameter learning rates (\(\beta_2=0.999\)).”
“AdamW fixes weight decay: applies \(\lambda \theta\) after adaptive scaling, not before.”
“Modern LLMs use AdamW with \(\beta_2=0.95\) (not 0.999) for stability at scale.”
“Memory cost: 2\(\times\) parameters for storing \(m_t\) and \(v_t\) states.”
AdaGrad/RMSProp:
“AdaGrad accumulates all past squared gradients–LR decays monotonically, can stop learning.”
“RMSProp fixes this with exponential moving average–discounts old gradients.”
Muon:
“Muon uses momentum + Newton-like orthogonalization, stores only \(m_t\) (not \(v_t\))–50% less memory than AdamW.”
“Emerging for 70B+ models where optimizer memory is limiting.”
Schedules:
“Warmup (2K-10K steps) prevents early instability; cosine decay to 10% LR over training.”
“GPT-3/LLaMA: peak LR \(6 \times 10^{-4}\) with warmup + cosine decay.”
14.2 When Asked in Interviews
“Why AdamW over Adam?”
Adam’s weight decay goes through adaptive scaling–weakens regularization
AdamW applies weight decay directly: \(\theta \gets \theta - \eta(\text{update} + \lambda \theta)\)
Better generalization, especially for transformers/LLMs
“Why does Adam need bias correction?”
Initialize \(m_0 = v_0 = 0\), so early estimates biased toward zero
Correction \((1 - \beta^t)^{-1}\) amplifies small initial values
Becomes negligible as \(t \to \infty\) (since \(\beta^t \to 0\))
“SGD vs Adam for vision models?”
SGD+momentum often better generalization on ImageNet
Adam faster convergence but can overfit
ViTs: can use either, but AdamW more common
“What’s \(\beta_2\) in modern LLMs?”
Default: 0.999 (slow adaptation)
LLMs: 0.95 (faster adaptation for stability with large batches)
Lower \(\beta_2\) = shorter memory of gradient variance
15 Summary
The Evolution of Optimizers:
SGD (1951): Foundation–simple but effective with momentum
AdaGrad (2011): Per-parameter adaptive rates, but LR decays too fast
RMSProp (2012): Fixes AdaGrad with exponential moving average
Adam (2015): Combined momentum + adaptive rates–became default
AdamW (2017): Fixed weight decay–standard for transformers/LLMs
Muon (2024): Memory-efficient via orthogonalization–emerging for very large models
Modern Practice (2024):
LLM Pre-training: AdamW (\(\beta_1=0.9, \beta_2=0.95\), \(\lambda=0.1\), warmup + cosine)
Vision: SGD+momentum or AdamW depending on architecture
Fine-tuning: AdamW with lower LR (\(10^{-5}\) to \(10^{-6}\))
Memory-constrained: Muon (experimental)
For questions, corrections, or suggestions: peymanr@gmail.com