13 Chapter 12: Diffusion Models

14 The Unifying Story: DAE + VAE = Diffusion

14.1 The Core Narrative

Diffusion models emerged from the marriage of two seemingly disparate ideas:

Denoising Autoencoders (DAE): Learn to remove noise from data, capturing local geometry via the score function (\(\nabla_x \log p(x)\))
Variational Autoencoders (VAE): Model data by learning to decode from a global iid noise prior \(\mathcal{N}(0,I)\)

The fundamental tension:

DAE works with small noise → learns local structure but cannot sample globally
VAE assumes full noise → reaches iid prior easily but requires compression bottleneck, loses fidelity

Diffusion’s insight: Bridge the gap by gradually adding noise across many steps, making each step a well-conditioned local denoising problem, while collectively reaching full iid noise.

Note

One-Line Summary:

Diffusion models are VAEs whose decoder is implemented as a continuous family of denoising autoencoders across noise scales.

15 Denoising Autoencoders: The Foundation

15.1 Classical DAE (Vincent et al., 2008–2011)

Training objective: \[\mathcal{L}_{\text{DAE}} = \mathbb{E}_{x \sim p_{\text{data}},\, \epsilon \sim \mathcal{N}(0,\sigma^2 I)} \big[\|f_\theta(x + \epsilon) - x\|^2\big]\]

Procedure:

Corrupt data slightly: \(\tilde{x} = x + \epsilon\)
Learn to reconstruct: \(f_\theta(\tilde{x}) \approx x\)

15.2 Connection to Score Functions

Key theoretical result (Vincent 2011):

As \(\sigma \to 0\), the optimal denoiser satisfies: \[f_\theta(x) - x \;\propto\; \nabla_x \log p(x)\]

Key insight: DAE learns the score (gradient of log density).

Note

What is the score function?

The score function \(\nabla_x \log p(x)\) is the gradient of the log probability density with respect to the data \(x\). It tells you which direction to move in data space to increase the probability.

Points "uphill" toward regions of higher probability density
At high-probability regions (e.g., real images): score is small (already at peak)
At low-probability regions (e.g., noise): score points strongly toward real data
The data score \(\nabla_x \log p_{\text{data}}(x)\) specifically refers to the score of the true data distribution

Why it matters: Denoising is equivalent to following this gradient to remove corruption – the denoiser learns to point from noisy samples back toward the data manifold.

Simplified proof intuition:

Consider corrupted data \(\tilde{x} = x + \epsilon\) where \(\epsilon \sim \mathcal{N}(0, \sigma^2 I)\). The optimal denoiser minimizes: \[\mathbb{E}_{p(x)} \mathbb{E}_{\epsilon}[\|f_\theta(\tilde{x}) - x\|^2]\]

Taking the gradient with respect to \(f_\theta(\tilde{x})\) and setting to zero: \[f_\theta(\tilde{x}) = \mathbb{E}[x | \tilde{x}]\]

By Tweedie’s formula, for Gaussian noise: \[\mathbb{E}[x | \tilde{x}] = \tilde{x} + \sigma^2 \nabla_{\tilde{x}} \log p(\tilde{x})\]

Therefore: \[f_\theta(\tilde{x}) - \tilde{x} = \sigma^2 \nabla_{\tilde{x}} \log p(\tilde{x})\]

Key takeaway: The denoising direction is exactly the score function \(\nabla \log p(\tilde{x})\) scaled by \(\sigma^2\). As \(\sigma \to 0\), the corrupted distribution converges to the clean data distribution: \(p(\tilde{x}) \to p_{\text{data}}(x)\). Therefore, denoising directly estimates the data score \(\nabla_x \log p_{\text{data}}(x)\) – the gradient of the log probability of the true data distribution.

Note

Connection to MCMC: Reverse-Time Fixed Points

Recall from MCMC (Part 1): we construct Markov operators whose stationary distribution is the target \(\pi(x)\). The forward diffusion process has a similar structure:

Forward process: \(q(x_t \mid x_{t-1})\) gradually adds noise, with stationary distribution \(\mathcal{N}(0, I)\)
Reverse process: \(p_\theta(x_{t-1} \mid x_t)\) removes noise, with stationary distribution \(p_{\text{data}}(x_0)\)

The key difference: MCMC uses detailed balance to maintain equilibrium at a single distribution, while diffusion uses time-reversed dynamics to construct a path between two distributions (\(\mathcal{N}(0,I) \to p_{\text{data}}\)).

Score matching is the reverse-process analog of acceptance ratios in Metropolis-Hastings–it ensures the reverse dynamics correctly invert the forward process by learning \(\nabla \log p_t(x_t)\) at each timestep.

Both MCMC and diffusion are fundamentally about designing stochastic dynamics with desired equilibrium properties. MCMC: single fixed point. Diffusion: trajectory between fixed points.

15.3 Manifold Learning with DAE

The manifold hypothesis: Natural data (images, text) lies on a low-dimensional manifold embedded in high-dimensional space. A denoising autoencoder learns this manifold structure:

Corruption: Adding noise pushes data points off the manifold
Denoising: The DAE learns to project corrupted points back onto the data manifold
Learned geometry: By training on multiple corruption levels, the DAE implicitly discovers the manifold’s local tangent space structure

Denoising autoencoder recovering manifold structure from corrupted inputs. The model learns to project noisy data back onto the underlying data manifold. Source: Vincent et al., “Extracting and Composing Robust Features with Denoising Autoencoders,” ICML 2008.

Why this matters for diffusion: Multi-scale manifold learning (applying denoising at different noise levels) is why diffusion produces such high-quality samples–it learns the data geometry at every scale from global structure to fine details.

15.4 Why Plain DAE Fails

The small-noise limitation:

Small noise (\(\sigma\) small): Only learns local geometry → cannot sample globally
Large noise (\(\sigma\) large): Denoising becomes ill-posed → one network cannot map pure noise \(\to\) data

This is the fundamental problem diffusion solves.

16 VAE: The Global Prior Perspective

16.1 VAE’s Contrasting Approach

VAE assumes a global latent code with iid prior: \[z \sim \mathcal{N}(0, I)\]

And learns: \[\begin{align} q_\phi(z|x) & \quad \text{(encoder)} \\ p_\theta(x|z) & \quad \text{(decoder)} \end{align}\]

16.2 VAE Training: The ELBO

The problem: We want to maximize \(\log p(x)\), but computing \(p(x) = \int p_\theta(x|z)p(z)dz\) is intractable.

The solution: Use variational inference with approximate posterior \(q_\phi(z|x)\). Starting from: \[\log p(x) = \log \int p_\theta(x|z)p(z)dz\]

Multiply and divide by \(q_\phi(z|x)\): \[\log p(x) = \log \mathbb{E}_{q_\phi(z|x)}\left[\frac{p_\theta(x|z)p(z)}{q_\phi(z|x)}\right]\]

By Jensen’s inequality (\(\log \mathbb{E}[X] \ge \mathbb{E}[\log X]\)): \[\log p(x) \ge \mathbb{E}_{q_\phi(z|x)}\left[\log \frac{p_\theta(x|z)p(z)}{q_\phi(z|x)}\right]\]

Rearranging: \[\log p(x) \ge \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - \text{KL}(q_\phi(z|x) \| p(z))\]

This is the Evidence Lower Bound (ELBO). It consists of:

Reconstruction term: \(\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]\) – how well decoder reconstructs \(x\) from \(z\)
Regularization term: \(\text{KL}(q_\phi(z|x) \| p(z))\) – keeps encoder’s posterior close to prior

Training: Maximize ELBO jointly over \(\theta\) (decoder) and \(\phi\) (encoder) using reparameterization trick.

16.3 VAE’s Limitations

+ Reaches iid noise easily
+ Enables global sampling
- Requires compression bottleneck
- Blurry reconstructions (due to single-step decode)

17 Diffusion Models: The Synthesis

17.1 The Key Insight

Note

Core Idea:

What if we keep DAE’s denoising principle, but make noise large by decomposing it into infinitesimal steps – and wrap it in a variational model?

This is exactly diffusion.

17.2 Forward Process: Gradual Noise Addition

Fixed corruption schedule: \[x_t = \sqrt{\alpha_t}\, x_0 + \sqrt{1-\alpha_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0,I)\]

where \(\alpha_t \in [0,1]\) decreases with \(t\).

As \(t \to T\): \[x_T \sim \mathcal{N}(0, I)\]

This solves the "large noise" problem by gradual destruction.

Example

Diffusion schedule example (1000 steps):

\(t=0\): \(\alpha_0 = 1\) → \(x_0\) = original image
\(t=250\): \(\alpha_{250} = 0.85\) → slight blur
\(t=500\): \(\alpha_{500} = 0.5\) → heavy noise
\(t=1000\): \(\alpha_{1000} \approx 0\) → pure Gaussian noise

17.3 Reverse Process: Multi-Scale Denoising

Learn to reverse each step: \[p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))\]

Each step is a small-noise DAE:

Easy to learn (well-conditioned)
Local denoising
Indexed by noise level \(t\)

Key insight: Diffusion is a continuum of DAEs indexed by noise level.

17.4 Training Objective: The Variational Connection

Diffusion training optimizes an ELBO: \[\log p(x_0) \ge \sum_{t=1}^T \mathbb{E}[\log p_\theta(x_{t-1}|x_t)] - \text{KL}(q(x_T|x_0) \| p(x_T))\]

Interpretation:

Latents: \(x_1, x_2, \ldots, x_T\) (entire noise trajectory)
Prior: \(p(x_T) = \mathcal{N}(0,I)\)
Encoder: Fixed (forward diffusion)
Decoder: Learned (denoising steps)

Note

Diffusion = Hierarchical VAE

This is not metaphorical – it is exact. Diffusion is a VAE where:

The latent space is the entire noise trajectory
The encoder is fixed (forward diffusion process)
The decoder is learned (reverse denoising process)

17.5 Simplified Training Loss (Ho et al., 2020)

In practice, the ELBO simplifies to: \[\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\big[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\big]\]

where \(\epsilon_\theta\) is a neural network predicting the noise \(\epsilon\) from noisy input \(x_t\).

Training procedure:

Sample \(x_0 \sim p_{\text{data}}\), \(t \sim \text{Uniform}(1, T)\), \(\epsilon \sim \mathcal{N}(0,I)\)
Compute \(x_t = \sqrt{\alpha_t} x_0 + \sqrt{1-\alpha_t} \epsilon\)
Train \(\epsilon_\theta\) to predict \(\epsilon\) from \(x_t\) and \(t\)

Example

PyTorch training loop (simplified):

def train_step(x0, model, t_max=1000):
    t = torch.randint(0, t_max, (x0.shape[0],))
    noise = torch.randn_like(x0)
    alpha_t = get_alpha(t)    # schedule
    x_t = sqrt(alpha_t) * x0 + sqrt(1 - alpha_t) * noise
    pred_noise = model(x_t, t)
    loss = F.mse_loss(pred_noise, noise)
    return loss

17.6 Sampling: Iterative Denoising

Generation procedure:

Start from pure noise: \(x_T \sim \mathcal{N}(0, I)\)
For \(t = T, T-1, \ldots, 1\):
- Predict noise: \(\epsilon_\theta(x_t, t)\)
- Denoise one step: \(x_{t-1} = \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t)) + \sigma_t z\)
Return \(x_0\) (generated sample)

18 Architectures in Practice

18.1 U-Net: The Standard Backbone

Architecture:

Encoder-decoder with skip connections
Time embedding \(t\) injected at each block
Self-attention at lower resolutions
Residual blocks with GroupNorm

Why U-Net?

Skip connections preserve high-frequency details
Multi-scale processing matches multi-scale noise
Efficient for images

18.2 Transformer-Based: Diffusion Transformers (DiT)

DiT (Peebles & Xie, 2023):

Replace U-Net with Vision Transformer
Patch-based input (like ViT)
Conditional via adaptive layer norm (adaLN)
Scales better than U-Net (compute vs quality)

When to use DiT vs U-Net:

U-Net: Images, faster inference, lower compute
DiT: High resolution, large-scale training, better scaling laws

19 Practical Models and Applications

19.1 Text-to-Image: Stable Diffusion

Key innovation: Latent diffusion (Rombach et al., 2022)

Architecture:

VAE encoder: \(z = \mathcal{E}(x)\) (compress \(512 \times 512\) to \(64 \times 64\))
Diffusion in latent space: \(z_t = \sqrt{\alpha_t} z + \sqrt{1-\alpha_t} \epsilon\)
U-Net with cross-attention to CLIP text embeddings
VAE decoder: \(\hat{x} = \mathcal{D}(z_0)\)

Why latent space?

+ \(8 \times\) compute reduction (vs pixel space)
+ Faster sampling (50 steps \(\sim\)5 seconds on GPU)
+ Still high fidelity

19.2 Other Production Models

Model	Organization	Key Feature
DALL-E 2	OpenAI	CLIP guidance, unCLIP decoder
Imagen	Google	Cascaded diffusion, T5 text encoder
Midjourney	Midjourney	Proprietary, aesthetic fine-tuning
Stable Diffusion	Stability AI	Open source, latent diffusion
Stable Diffusion 3	Stability AI	Rectified flow, improved text rendering

19.3 Where Diffusion Excels

Visual generation (images, video):

Photo-realistic images (beats GANs on fidelity)
Fine detail preservation (high-frequency content)
Controllability (text, sketch, depth, pose)

Audio (speech, music):

WaveGrad (speech synthesis)
Diff-SVC (singing voice conversion)
Riffusion (music generation from text)

3D and motion:

DreamFusion (text-to-3D via score distillation)
MDM (human motion diffusion)

When NOT to use diffusion:

Real-time applications (slow sampling)
Low-dimensional data (overkill)
Discrete data (text) – autoregressive models better

20 Key Developments Timeline

2008–2011: Denoising Autoencoders (Vincent et al.) – score matching connection
2015: Diffusion Probabilistic Models (Sohl-Dickstein et al.) – forward/reverse process formulation
2019: Score-Based Generative Models (Song & Ermon) – Langevin dynamics, annealed noise
2020: DDPM (Ho et al.) – simplified loss, practical training
2021: Score SDEs (Song et al.) – continuous-time limit, probability flow ODE
2021: Guided Diffusion (Dhariwal & Nichol) – classifier guidance, beats GANs on ImageNet
2022: Stable Diffusion (Rombach et al.) – latent diffusion, open source explosion
2022: DALL-E 2, Imagen – production text-to-image systems
2023: Consistency Models (Song et al.) – single-step generation
2024: Rectified Flow/Flow Matching – deterministic trajectories, faster sampling

21 Interview Cheat Phrases

“Diffusion models are VAEs whose decoder is a continuous family of denoising autoencoders across noise scales.”
“DAEs learn local geometry via score matching, VAEs assume global iid priors – diffusion bridges them with gradual noise addition.”
“Training is simple: corrupt data to random timestep, predict noise, MSE loss. Sampling is iterative denoising from pure noise.”
“Latent diffusion (Stable Diffusion) runs diffusion in VAE latent space – 8× faster, same quality.”
“Diffusion beats GANs on fidelity but is slower at inference – typical trade-off is 50 steps vs 1 forward pass.”
“U-Net is standard for images (skip connections, multi-scale). DiT (transformer) scales better for large models.”
“Key papers: DDPM (2020) for practical training, Score SDEs (2021) for theory, Stable Diffusion (2022) for production.”

Note

Common Interview Question: How does diffusion differ from VAE?

Both are likelihood-based generative models with variational objectives. VAE learns a single compressed latent \(z\) and decodes in one shot – fast but blurry. Diffusion learns a sequence of latents \(x_1, \ldots, x_T\) representing noise levels, with a fixed encoder (forward diffusion) and learned decoder (reverse denoising). This hierarchical structure enables higher fidelity at the cost of iterative sampling.

22 Practical Considerations

22.1 Sampling Speed

Problem: Standard DDPM requires 1000 steps → slow

Solutions:

DDIM (Song et al., 2021): Deterministic sampling, skip steps (50 steps often sufficient)
Consistency models: Direct mapping \(x_T \to x_0\) (single step)
Distillation: Train student to mimic teacher in fewer steps

22.2 Conditioning and Control

Text conditioning:

Cross-attention to text embeddings (CLIP, T5)
Classifier-free guidance: \(\epsilon_\theta(x_t, c) = \epsilon_\theta(x_t, \emptyset) + w \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \emptyset))\)

Structural conditioning:

ControlNet: Add spatial conditioning (edges, depth, pose)
Inpainting: Mask \(+\) noisy background

22.3 Evaluation Metrics

FID (Fréchet Inception Distance): Measures distribution match to real data
IS (Inception Score): Measures diversity and quality
CLIP Score: Text-image alignment for conditional models

For questions, corrections, or suggestions: peymanr@gmail.com