5  Chapter 4: Optimization Algorithms for Deep Learning

6 Introduction: The Optimization Landscape

Training neural networks requires iteratively updating parameters \(\theta\) to minimize loss \(\mathcal{L}(\theta)\). The choice of optimizer profoundly affects:

  • Convergence speed: How quickly loss decreases

  • Final performance: Generalization to test data

  • Training stability: Avoiding divergence, especially at scale

  • Memory overhead: Per-parameter state storage

Evolution:

  1. SGD (1951): Simple gradient descent with momentum

  2. Adaptive Learning Rates (2011-2014): AdaGrad, RMSProp–scale gradients per-parameter

  3. Adam Era (2015): Combines momentum + adaptive rates–became default for deep learning

  4. Regularization Fixes (2017): AdamW decouples weight decay from gradient updates

  5. Modern Variants (2023-2024): Muon, Lion, Sophia–optimized for LLM-scale training

Modern LLM Practice:

  • Pre-training: AdamW with \(\beta_1=0.9, \beta_2=0.95\) (GPT-3, LLaMA, Qwen)

  • Fine-tuning: AdamW or SGD with warmup + cosine decay

  • Emerging: Muon (momentum + orthogonalization) for memory efficiency

7 Stochastic Gradient Descent (SGD)

7.1 Vanilla SGD

Update Rule: \[\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)\] where \(\eta\) is the learning rate.

Properties:

  • + Minimal memory: no per-parameter state

  • + Well-understood theory

  • - Sensitive to learning rate choice

  • - Slow convergence, especially for ill-conditioned problems

7.2 SGD with Momentum

Accumulates velocity to smooth updates and accelerate in consistent directions:

\[\begin{align} v_{t+1} & = \beta v_t + \nabla_\theta \mathcal{L}(\theta_t) \\ \theta_{t+1} & = \theta_t - \eta v_{t+1} \end{align}\]

where \(\eta\) is the learning rate (step size), \(\beta\) is the momentum coefficient (typically 0.9), and \(v_t\) is the velocity (accumulated update direction).

Common choice: \(\beta = 0.9\) (retains 90% of previous velocity)

Intuition:

  • Acts like a ball rolling downhill–accumulates momentum in consistent directions

  • Dampens oscillations in high-curvature directions

  • Can overshoot minima but often reaches better generalization

7.3 Nesterov Accelerated Gradient (NAG)

Looks ahead before computing gradient: \[\begin{align} v_{t+1} & = \beta v_t + \nabla_\theta \mathcal{L}(\theta_t - \eta \beta v_t) \\ \theta_{t+1} & = \theta_t - \eta v_{t+1} \end{align}\]

Advantage: Better correction when approaching minima (gradient computed at “lookahead” position).

8 Adaptive Learning Rate Methods

8.1 AdaGrad (2011)

Adapts learning rate per parameter based on historical gradient magnitudes.

Update Rule: \[\begin{align} G_t & = G_{t-1} + \nabla_\theta \mathcal{L}(\theta_t) \odot \nabla_\theta \mathcal{L}(\theta_t) \quad \text{(accumulate squared gradients)} \\ \theta_{t+1} & = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \odot \nabla_\theta \mathcal{L}(\theta_t) \end{align}\]

where \(\odot\) is element-wise product, \(\epsilon \approx 10^{-8}\) for numerical stability.

Intuition:

  • Parameters with large cumulative gradients get smaller learning rates

  • Parameters with small cumulative gradients get larger learning rates

  • Useful for sparse features (NLP, recommendation systems)

Problem: Learning rate monotonically decreases–can stop learning prematurely in deep networks.

8.2 RMSProp (2012)

Fixes AdaGrad’s monotonic decay by using exponential moving average of squared gradients.

Update Rule: \[\begin{align} v_t & = \beta v_{t-1} + (1-\beta) \nabla_\theta \mathcal{L}(\theta_t) \odot \nabla_\theta \mathcal{L}(\theta_t) \\ \theta_{t+1} & = \theta_t - \frac{\eta}{\sqrt{v_t + \epsilon}} \odot \nabla_\theta \mathcal{L}(\theta_t) \end{align}\]

Common choice: \(\beta = 0.9\) (decay rate for moving average)

Advantage: Discounts old gradients–learning rate adapts to recent gradient history, not entire history.

9 Adam: Adaptive Moment Estimation (2015)

Adam combines momentum (first moment) with RMSProp’s adaptive learning rates (second moment).

9.1 Adam Algorithm

Hyperparameters:

  • \(\eta\): learning rate (often \(10^{-3}\) to \(10^{-4}\))

  • \(\beta_1 = 0.9\): exponential decay for first moment (momentum)

  • \(\beta_2 = 0.999\): exponential decay for second moment (adaptive rate)

  • \(\epsilon = 10^{-8}\): numerical stability

Update Rule: \[\begin{align} m_t & = \beta_1 m_{t-1} + (1-\beta_1) \nabla_\theta \mathcal{L}(\theta_t) \quad \text{(first moment, momentum)} \\ v_t & = \beta_2 v_{t-1} + (1-\beta_2) \nabla_\theta \mathcal{L}(\theta_t)^2 \quad \text{(second moment, adaptive rate)} \\ \hat{m}_t & = \frac{m_t}{1 - \beta_1^t} \quad \text{(bias correction for first moment)} \\ \hat{v}_t & = \frac{v_t}{1 - \beta_2^t} \quad \text{(bias correction for second moment)} \\ \theta_{t+1} & = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t \end{align}\]

Note

Why Bias Correction?

Since \(m_0 = v_0 = 0\), early estimates are biased toward zero. Dividing by \((1 - \beta^t)\) compensates:

  • At \(t=1\): \(1 - \beta_1^1 = 0.1\), so \(\hat{m}_1 = m_1 / 0.1 = 10 m_1\) (amplify small initial moment)

  • As \(t \to \infty\): \(\beta_1^t \to 0\), correction factor \(\to 1\) (no correction needed)

9.2 Adam’s Strengths and Weaknesses

Strengths:

  • + Works well out-of-the-box for wide range of problems

  • + Robust to hyperparameter choices (default \(\beta_1, \beta_2\) often sufficient)

  • + Fast initial convergence

  • + Handles sparse gradients well

Weaknesses:

  • - Can converge to worse solutions than SGD+momentum in some cases (generalization gap)

  • - Weight decay implementation was incorrect in original paper (fixed in AdamW)

  • - Memory: requires 2\(\times\) parameters for \(m_t, v_t\) storage

10 AdamW: Adam with Decoupled Weight Decay (2017)

10.1 The Weight Decay Problem in Adam

Terminology Clarification:

  • L2 regularization: Add \(\frac{\lambda}{2}\|\theta\|^2\) penalty to loss function

  • Weight decay: Directly shrink parameters by factor \((1 - \lambda)\) each step

  • For SGD, these are equivalent. For adaptive optimizers (Adam), they are not!

L2 Regularization Approach:

Add penalty to loss: \[\mathcal{L}_{\text{reg}}(\theta) = \mathcal{L}(\theta) + \frac{\lambda}{2} \|\theta\|^2\]

Taking gradient (applies to all parameters including weights \(W\) and biases \(b\)): \[\nabla_\theta \mathcal{L}_{\text{reg}} = \nabla_\theta \mathcal{L} + \lambda \theta\]

Problem in Adam: This \(\lambda \theta\) term goes through adaptive scaling by \(\frac{1}{\sqrt{\hat{v}_t}}\), which weakens the regularization effect for parameters with large gradients.

Example:

  • Parameter \(\theta_1\) has large gradients \(\Rightarrow\) large \(\hat{v}_t\) \(\Rightarrow\) \(\lambda\theta_1\) gets divided by large \(\sqrt{\hat{v}_t}\) \(\Rightarrow\) weak regularization

  • Parameter \(\theta_2\) has small gradients \(\Rightarrow\) small \(\hat{v}_t\) \(\Rightarrow\) \(\lambda\theta_2\) gets divided by small \(\sqrt{\hat{v}_t}\) \(\Rightarrow\) strong regularization

  • Inconsistent regularization strength across parameters!

10.2 AdamW Solution: Decouple Weight Decay

Instead of adding \(\lambda \theta\) to gradient (L2 regularization), apply weight decay directly to parameters:

\[\begin{align} m_t & = \beta_1 m_{t-1} + (1-\beta_1) \nabla_\theta \mathcal{L}(\theta_t) \\ v_t & = \beta_2 v_{t-1} + (1-\beta_2) \nabla_\theta \mathcal{L}(\theta_t)^2 \\ \hat{m}_t & = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \\ \theta_{t+1} & = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right) \end{align}\]

Key Difference: \(\lambda \theta_t\) is added after adaptive scaling (not before), ensuring consistent regularization strength across all parameters.

Equivalently (clearer form): \[\begin{align} \theta_{t+1} & = (1 - \eta\lambda) \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{align}\]

This shows weight decay as direct parameter shrinkage by factor \((1 - \eta\lambda)\), independent of gradient magnitude.

Note

Why “Weight Decay” vs “L2 Regularization”?

  • L2 regularization: Penalty on loss \(\Rightarrow\) affects gradients \(\Rightarrow\) goes through optimizer’s adaptive scaling

  • Weight decay: Direct parameter shrinkage \(\Rightarrow\) bypasses optimizer \(\Rightarrow\) uniform regularization

  • For SGD: These are equivalent (no adaptive scaling)

  • For Adam/AdamW: Weight decay is the correct approach

10.3 AdamW in Modern LLMs

Standard Configuration (GPT-3, LLaMA, Qwen):

  • \(\eta = 6 \times 10^{-4}\) (peak learning rate after warmup)

  • \(\beta_1 = 0.9\)

  • \(\beta_2 = 0.95\) (lower than default 0.999 for better stability at scale)

  • \(\lambda = 0.1\) (weight decay)

  • Warmup: Linear increase over 2,000 steps

  • Decay: Cosine decay to 10% of peak LR

Why \(\beta_2 = 0.95\) for LLMs?

  • Lower \(\beta_2\) means faster adaptation to recent gradient changes

  • Helps with stability when training with very large batch sizes (millions of tokens)

  • Empirically better for long training runs (multiple epochs over trillions of tokens)

11 Muon: Momentum Orthogonalized by Newton (2024)

Muon is a recent optimizer designed for memory-efficient LLM training.

11.1 Motivation

Problem with Adam/AdamW:

  • Requires storing \(m_t, v_t\) for every parameter (2\(\times\) memory overhead)

  • For 70B model with BF16: params (140GB) + optimizer state (280GB) = 420GB total

  • This limits training on memory-constrained hardware

11.2 Muon Algorithm

Key Ideas:

  1. Momentum (like Adam’s first moment): \(m_t = \beta m_{t-1} + (1-\beta) \nabla_\theta \mathcal{L}\)

  2. Newton-like scaling: Use approximate Hessian information via orthogonalization

  3. Memory efficiency: Store only \(m_t\) (not second moment \(v_t\))

Simplified Update: \[\begin{align} m_t & = \beta m_{t-1} + (1-\beta) \nabla_\theta \mathcal{L}(\theta_t) \\ m_t^{\text{orth}} & = \text{Orthogonalize}(m_t) \quad \text{(project away high-curvature directions)} \\ \theta_{t+1} & = \theta_t - \eta m_t^{\text{orth}} \end{align}\]

Note

Orthogonalization: Geometry Without the Hessian

Core idea: Remove momentum components aligned with high-curvature directions to prevent oscillations in sharp valleys.

Assume local quadratic structure: \(L(\theta + \delta) \approx L(\theta) + g^\top \delta + \frac{1}{2}\delta^\top H \delta\). High-curvature eigenvectors of \(H\) cause instability. Newton uses \(\delta = H^{-1}g\), but computing \(H\) is prohibitive.

Orthogonalization approximates this by deflating sharp directions: \[m_t^{\text{orth}} = (I - U_k U_k^\top) m_t\] where \(U_k\) spans dominant curvature directions (estimated without \(H\)).

Practical implementations (cheapest to strongest):

Method What it does Used in
Layer-wise normalization \(m_\ell / \|m_\ell\|_2\) (per-layer trust) LAMB, LARS
RMS-based rescaling \(D_t^{-1/2} m_t\) where \(D_t \approx g_t^2\) Adam (implicitly)
Blockwise QR \(m_t - Q(Q^\top m_t)\) on small blocks Shampoo, Muon
Power iteration on \(Hv\) Find top eigenvector via autodiff Sophia-style

Why this works: Momentum aligns with long-term descent but amplifies oscillations. Orthogonalization removes high-curvature components, keeping motion along flat, generalizable directions–equivalent to partial Newton step + trust region.

Relation to known optimizers:

  • SGD+momentum: Low-pass filter on gradients

  • Adam: Diagonal curvature whitening (via \(v_t\))

  • Shampoo: Blockwise Newton step

  • Muon: Momentum + subspace deflation (Adam without \(v_t\), geometry-aware cleaning)

11.3 Advantages and Tradeoffs

Advantages:

  • + 50% less optimizer memory than AdamW (stores only \(m_t\), not \(v_t\))

  • + Comparable convergence to AdamW on LLM pre-training

  • + Better conditioning via orthogonalization (like Newton’s method)

Tradeoffs:

  • - More compute per step (orthogonalization overhead)

  • - Less mature–fewer hyperparameter recipes than AdamW

  • - Not yet widely adopted in production systems

When to Consider Muon:

  • Training very large models (70B+) where optimizer memory is limiting

  • Hardware with limited HBM (e.g., older GPUs, edge training)

  • Willing to experiment with hyperparameters for potential gains

12 Comparison: Which Optimizer to Use?

Optimizer Memory Convergence Use Case Modern LLMs?
SGD + Momentum Low Slow CNNs, fine-tuning Rare
AdaGrad Medium Moderate Sparse features No
RMSProp Medium Fast RNNs, RL Rare
Adam High (2\(\times\)) Fast General DL Yes (legacy)
AdamW High (2\(\times\)) Fast Transformers, LLMs Standard
Muon Medium (1\(\times\)) Fast Memory-constrained Emerging

12.1 Practical Recommendations

Default Choice: AdamW with standard hyperparameters

  • \(\eta = 3 \times 10^{-4}\) (small models), \(6 \times 10^{-4}\) (large LLMs)

  • \(\beta_1 = 0.9\), \(\beta_2 = 0.95\) (LLMs) or \(0.999\) (smaller models)

  • Weight decay \(\lambda = 0.1\)

  • Warmup + cosine decay schedule

When to Use SGD + Momentum:

  • Computer vision (ResNets, ViTs) where SGD often generalizes better

  • Fine-tuning pre-trained models with small LR

  • When memory is extremely constrained

When to Experiment with Muon:

  • Training 70B+ models with limited GPU memory

  • Research settings exploring memory-efficient optimizers

13 Learning Rate Schedules

The learning rate \(\eta\) typically varies during training. Common schedules:

13.1 Warmup

Motivation: Large initial gradients can destabilize training. Warmup gradually increases LR from 0 to target.

Linear Warmup: \[\eta_t = \eta_{\max} \cdot \frac{t}{T_{\text{warmup}}} \quad \text{for } t \leq T_{\text{warmup}}\]

Typical: 2,000-10,000 steps for LLMs

13.2 Cosine Decay

After warmup, decay learning rate following cosine curve: \[\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min}) \left(1 + \cos\left(\frac{t - T_{\text{warmup}}}{T_{\text{total}} - T_{\text{warmup}}} \pi\right)\right)\]

Common choice: \(\eta_{\min} = 0.1 \cdot \eta_{\max}\) (decay to 10% of peak)

13.3 Step Decay

Reduce LR by factor (e.g., 0.1) at fixed intervals: \[\eta_t = \eta_0 \cdot \gamma^{\lfloor t / T_{\text{step}} \rfloor}\]

Less common in modern LLM training (cosine decay preferred).

14 Interview Cheat Sheet

14.1 Quick Facts

SGD:

  • “SGD with momentum (\(\beta=0.9\)) still SOTA for vision–often better generalization than Adam.”

  • “Nesterov looks ahead before computing gradient–better correction near minima.”

Adam/AdamW:

  • “Adam combines momentum (\(\beta_1=0.9\)) with adaptive per-parameter learning rates (\(\beta_2=0.999\)).”

  • “AdamW fixes weight decay: applies \(\lambda \theta\) after adaptive scaling, not before.”

  • “Modern LLMs use AdamW with \(\beta_2=0.95\) (not 0.999) for stability at scale.”

  • “Memory cost: 2\(\times\) parameters for storing \(m_t\) and \(v_t\) states.”

AdaGrad/RMSProp:

  • “AdaGrad accumulates all past squared gradients–LR decays monotonically, can stop learning.”

  • “RMSProp fixes this with exponential moving average–discounts old gradients.”

Muon:

  • “Muon uses momentum + Newton-like orthogonalization, stores only \(m_t\) (not \(v_t\))–50% less memory than AdamW.”

  • “Emerging for 70B+ models where optimizer memory is limiting.”

Schedules:

  • “Warmup (2K-10K steps) prevents early instability; cosine decay to 10% LR over training.”

  • “GPT-3/LLaMA: peak LR \(6 \times 10^{-4}\) with warmup + cosine decay.”

14.2 When Asked in Interviews

“Why AdamW over Adam?”

  • Adam’s weight decay goes through adaptive scaling–weakens regularization

  • AdamW applies weight decay directly: \(\theta \gets \theta - \eta(\text{update} + \lambda \theta)\)

  • Better generalization, especially for transformers/LLMs

“Why does Adam need bias correction?”

  • Initialize \(m_0 = v_0 = 0\), so early estimates biased toward zero

  • Correction \((1 - \beta^t)^{-1}\) amplifies small initial values

  • Becomes negligible as \(t \to \infty\) (since \(\beta^t \to 0\))

“SGD vs Adam for vision models?”

  • SGD+momentum often better generalization on ImageNet

  • Adam faster convergence but can overfit

  • ViTs: can use either, but AdamW more common

“What’s \(\beta_2\) in modern LLMs?”

  • Default: 0.999 (slow adaptation)

  • LLMs: 0.95 (faster adaptation for stability with large batches)

  • Lower \(\beta_2\) = shorter memory of gradient variance

15 Summary

The Evolution of Optimizers:

  1. SGD (1951): Foundation–simple but effective with momentum

  2. AdaGrad (2011): Per-parameter adaptive rates, but LR decays too fast

  3. RMSProp (2012): Fixes AdaGrad with exponential moving average

  4. Adam (2015): Combined momentum + adaptive rates–became default

  5. AdamW (2017): Fixed weight decay–standard for transformers/LLMs

  6. Muon (2024): Memory-efficient via orthogonalization–emerging for very large models

Modern Practice (2024):

  • LLM Pre-training: AdamW (\(\beta_1=0.9, \beta_2=0.95\), \(\lambda=0.1\), warmup + cosine)

  • Vision: SGD+momentum or AdamW depending on architecture

  • Fine-tuning: AdamW with lower LR (\(10^{-5}\) to \(10^{-6}\))

  • Memory-constrained: Muon (experimental)



For questions, corrections, or suggestions: peymanr@gmail.com