12 Chapter 11: Key Architectures (ResNet, BERT, GPT, Qwen)

13 Overview

This chapter covers production architectures in vision and NLP: ResNet, BERT, GPT, LLaMA, and modern variants. We examine why design choices matter–not just specs, but the reasoning from billions of GPU-hours of experiments.

Key questions: Why bottleneck blocks in ResNet? Why pre-norm in GPT-2? Why RMSNorm in LLaMA? We trace evolution from GPT-2 (1.5B) to LLaMA-3 (405B), BERT’s 512 tokens to Gemini’s 10M, and production trade-offs: 7B vs 70B, decoder-only vs encoder-decoder, MoE vs dense.

Note

Cross-Reference Guide:

RoPE mechanism: See attention.tex §3.2
GQA vs MHA vs MQA: See attention.tex §3.3
MLA (Multi-head Latent Attention): See attention.tex (DeepSeek section)
Sliding window attention: See attention.tex §4
MoE routing & load balancing: See attention.tex §8
Flash Attention: See attention.tex §4.3
Distributed training (ZeRO, FSDP): See training_optimization.tex

13.1 Architecture Taxonomy

Three paradigms:

Encoder-only (BERT, RoBERTa): Bidirectional attention → rich understanding, no generation. Best for classification, NER, embeddings.

Decoder-only (GPT, LLaMA, Mistral): Causal attention ($i$ sees $\leq i$) → efficient KV caching, dominant 2023-2024. Why: unified pre-training, efficient generation, predictable scaling, instruction-tuning flexibility.

Encoder-decoder (T5, BART): Bidirectional encoder + autoregressive decoder. Natural for translation/summarization but more complex (dual stacks, no prefix caching).

14 Vision Architectures

14.1 ResNet (Deep Residual Networks)

Before ResNet, deep networks suffered optimization failure–more layers made performance worse (vanishing/exploding gradients). ResNet (He et al., 2015) solved this with residual connections.

The Key Insight: Instead of learning the desired output $H(x)$ directly, learn the residual $F(x) = H(x) - x$ and add it to the input: \[y = x + F(x), \quad \text{where } F(x) \text{ is learned by the layer's weights}\]

Important: The network directly learns $F(x)$–we never compute $H(x)$ explicitly. The notation $F(x) = H(x) - x$ is just showing the relationship: if we want output $H(x)$, the residual branch must produce $F(x) = H(x) - x$.

Why this helps: If the optimal mapping is close to identity ($H(x) \approx x$), it’s easier to learn $F(x) \approx 0$ (push weights toward zero) than to learn $H(x) = x$ from scratch (requires precise weight tuning to reproduce the input). Skip connections also create direct gradient paths–gradients flow backwards through the identity without attenuation.

Block types: Basic blocks (ResNet-18/34): two $3 \times 3$ convs. Bottleneck blocks (ResNet-50/101/152): $1\times1 \rightarrow 3\times3 \rightarrow 1\times1$ (compress → process → expand). Bottleneck uses 70% fewer FLOPs than stacking $3\times3$ convs.

Bottleneck Block (ResNet-50/101/152):

Key Details:

Compress: $1\times1$ conv reduces channels (256 → 64) to save computation
Process: $3\times3$ conv operates on reduced dimension (70% fewer FLOPs)
Expand: $1\times1$ conv restores channels (64 → 256) to match skip connection
BatchNorm + ReLU: After each conv (except last–ReLU comes after addition)
Skip path: Identity when dimensions match; $1\times1$ projection when changing channels/spatial size
Final ReLU: Applied after adding residual, not before

ResNet-50: Stem ($7\times7$ conv/2 + maxpool) → 4 stages of bottlenecks (256 → 512 → 1024 → 2048 channels) → global avg pool → FC. ResNet variants (ResNeXt, EfficientNet) modify blocks but keep residual principle.

ResNet-50 macro-architecture:

14.2 MobileNet: Efficient CNNs for Mobile/Edge

MobileNet (Howard et al., 2017) achieves mobile-friendly efficiency via depthwise separable convolutions.

Standard Convolution:

Input: $D_F \times D_F \times M$ (spatial $D_F$, $M$ input channels)
Kernel: $D_K \times D_K \times M \times N$ ($N$ output channels)
Cost: $D_K \times D_K \times M \times N \times D_F \times D_F$ MACs (multiply-accumulates)

Depthwise Separable = Depthwise + Pointwise:

Depthwise conv: $3\times3$ conv per channel (no mixing channels)
- Kernel: $D_K \times D_K \times 1$ per channel (total $M$ kernels)
- Cost: $D_K \times D_K \times M \times D_F \times D_F$ MACs
Pointwise conv: $1\times1$ conv to mix channels
- Kernel: $1 \times 1 \times M \times N$
- Cost: $M \times N \times D_F \times D_F$ MACs

Reduction Factor: \[\frac{\text{Depthwise separable cost}}{\text{Standard conv cost}} = \frac{D_K^2 \cdot M \cdot D_F^2 + M \cdot N \cdot D_F^2}{D_K^2 \cdot M \cdot N \cdot D_F^2} = \frac{1}{N} + \frac{1}{D_K^2}\] For $3\times3$ convs ($D_K=3$) and many channels ($N \gg 1$): reduction $\approx \frac{1}{9} + \epsilon$ → **8-9× fewer MACs**.

MobileNetV2/V3 Innovations:

Inverted residuals (V2): Expand channels in bottleneck ($64 \to 384 \to 64$) instead of compress
Linear bottleneck: Remove ReLU before projection (preserve information in low-dim space)
Squeeze-and-Excitation (V3): Channel attention mechanism (lightweight)
h-swish activation (V3): Hardware-friendly approximation of Swish

14.3 EfficientNet: Compound Scaling

EfficientNet (Tan & Le, 2019) optimizes depth, width, and resolution jointly via compound scaling.

Traditional Scaling (suboptimal):

Depth scaling: More layers (ResNet-50 → ResNet-152)
Width scaling: More channels per layer (ResNet-50 → WideResNet)
Resolution scaling: Larger input images (224×224 → 299×299)

Scaling one dimension hits diminishing returns; EfficientNet scales all three.

Compound Scaling Formula: \[\text{depth: } d = \alpha^\phi, \quad \text{width: } w = \beta^\phi, \quad \text{resolution: } r = \gamma^\phi\] subject to $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$ and $\alpha \geq 1, \beta \geq 1, \gamma \geq 1$.

Constraint ensures FLOPs grow as $2^\phi$ (doubling compute per step). Grid search finds $\alpha=1.2, \beta=1.1, \gamma=1.15$ (EfficientNet-B0 baseline).

MBConv Block (EfficientNet building block):

Inverted residual bottleneck (MobileNetV2-style)
Squeeze-and-Excitation (SE) attention
Stochastic depth (drop path regularization)

EfficientNet Family:

B0: 5.3M params, 0.39B FLOPs, baseline found via neural architecture search (NAS)
B1-B7: Scale B0 using compound scaling ($\phi = 1, 2, \ldots, 7$)
B7: 66M params, 37B FLOPs, 84.3% ImageNet top-1 (SOTA at publication)
EfficientNetV2: Faster training (Fused-MBConv blocks, adaptive regularization)

Why This Matters:

EfficientNet-B4 (19M params) matches ResNet-152 (60M params) accuracy with 10× fewer FLOPs
Demonstrates importance of balanced scaling vs just "more layers"
Widely used for mobile/edge vision (object detection, segmentation) before ViTs

15 Encoder-Only Models

15.1 BERT (Bidirectional Encoder Representations from Transformers)

BERT (Devlin et al., 2019) brought bidirectional pre-training to transformers. Unlike GPT’s left-to-right modeling, BERT sees full sequence (past + future) when encoding each token → powerful for understanding, unsuitable for generation.

Architecture: $N$ encoder blocks (self-attn + FFN), learned position embeddings, post-norm (LayerNorm after attn/FFN). BERT-Base: 12L, 768H, 12 heads, 110M params. BERT-Large: 24L, 1024H, 16 heads, 340M params.

Training: MLM (mask 15% tokens, predict from bidirectional context) + NSP (next sentence prediction). Data: BooksCorpus + Wikipedia (3.3B words), 512 tokens max.

Limitations: 512-token limit too short, bidirectional prevents generation, MLM creates pre-train/fine-tune mismatch ([MASK] seen in training, not inference).

15.2 RoBERTa (Robustly Optimized BERT)

RoBERTa (Liu et al., 2019, Meta): same architecture as BERT-Large, but better training. Remove NSP (harmful), dynamic masking (different masks each epoch), larger batches (8K vs 256), more data (160GB vs 16GB), longer training (500K steps). Result: SOTA on GLUE/SQuAD/RACE without architectural changes.

Lesson: How you train often matters more than architecture. LLaMA-3 gains over LLaMA-2 come from 10× more tokens, not architecture.

Note

Interview Insight: BERT is rarely used for generation (no causal masking). Modern practice: Use decoder-only models (LLaMA, Mistral) even for classification via instruction-tuning. BERT still relevant for embeddings (sentence-transformers) and low-latency classification where you need bidirectional understanding without generation.

16 Decoder-Only Models

16.1 GPT (Generative Pre-trained Transformer)

GPT established the decoder-only paradigm dominating modern LLMs. Unlike BERT’s bidirectional encoding, GPT uses causal attention: token $i$ only sees positions $0$ through $i-1$. This enables autoregressive generation but sacrifices bidirectional context.

GPT-1 (2018): 12 layers, 768 dim, 117M params on BooksCorpus. Key insight: unsupervised pre-training transfers to supervised tasks.

GPT-2 (2019): 1.5B parameters. Crucial change: pre-normalization (LayerNorm before attention/FFN) stabilizes deep network training via cleaner gradient flow. Demonstrated zero-shot learning by framing tasks as text completion.

Decoder-only transformer stack:

GPT-3 (2020): 175B params (96L, 12,288H, 2048 ctx). Demonstrated in-context learning–few-shot prompting without gradient updates. Training: $4.6M on 10K V100s, 300B tokens. Architecture similar to GPT-2 but with sparse attention in some layers. Real innovation: scale enabled instruction-following, coding, arithmetic from emergent capabilities.

GPT-4 (2023): Estimated 1.7T params (MoE, 280B active/token). Multi-modal (text+images), 128K context. Extensive RLHF for alignment.

16.2 LLaMA Family (Meta)

Meta’s LLaMA democratized LLM research with competitive open-weights models on public data. Its architecture became the open-source standard.

LLaMA-1 (2023): 7B/13B/33B/65B sizes. 7B spec: 32L, 4096H, 32 heads ($d_k=128$), 2048 ctx (→4096 via RoPE).

Key innovations:

RoPE: $q_m = R_m q, k_n = R_n k$ → attention depends on $(m-n)$, enables length extrapolation
SwiGLU: $\text{SwiGLU}(x) = \text{Swish}(xW_1) \odot (xW_2)$ → better quality
RMSNorm: $\frac{x}{\sqrt{\text{mean}(x^2) + \epsilon}} \cdot \gamma$ → simpler than LayerNorm
No bias terms → better training stability at scale

Training: 1.4T tokens (CommonCrawl, C4, GitHub, Wikipedia, arXiv, StackExchange). 65B took 21 days on 2048 A100s.

LLaMA-2 (2023): 2T tokens, 4K context, GQA (groups query heads to share KV projections → 4-8× KV cache reduction). Added instruction-tuning + RLHF, competitive with GPT-3.5.

LLaMA-3 (2024): 15T tokens (10× more), 128K vocab (from 32K), 8K ctx → 128K via fine-tuning. All sizes use GQA (8 KV heads). 405B flagship: 126L, 16,384H, 128 Q heads, 8 KV heads. Trained on 16K H100s. Multi-stage RLHF + rejection sampling + DPO.

Example

LLaMA Architecture Summary (7B/8B):

Component	LLaMA-1 (7B)	LLaMA-3 (8B)
Layers	32	32
Hidden dim	4096	4096
Heads	32	32
KV heads	32 (MHA)	8 (GQA)
Context	2048	8192
Vocab	32K	128K
Tokens trained	1T	15T
FFN dim	11,008	14,336

16.3 Mistral & Mixtral (Mistral AI)

Mistral 7B (Jiang et al., 2023):

Architecture:

32 layers, 4096 hidden, 32 heads, 7.3B params
Sliding Window Attention (SWA): Each layer only attends to previous 4096 tokens (window size)
GQA: 8 KV heads (4$\times$ compression)
Context: 8K tokens (via SWA), effective receptive field grows with layers
RoPE, SwiGLU, RMSNorm (same as LLaMA)

Key Innovation: Sliding window allows longer context with $O(w)$ memory per layer (vs $O(n)$ for full attention), where $w=4096$

Mixtral 8x7B (Jiang et al., 2024):

Architecture:

Sparse Mixture-of-Experts (MoE)
8 experts per layer (each expert is 7B FFN)
Router selects top-2 experts per token
Total: 46.7B params, only 12.9B active per token
32 layers, 4096 hidden, same attention as Mistral 7B
Context: 32K tokens

MoE Benefits:

Inference cost of 12.9B model, performance of 46.7B model
Outperforms LLaMA-2 70B on most benchmarks
Same latency as Mistral 7B (2$\times$ experts in parallel)

16.4 Qwen (Alibaba Cloud)

Alibaba’s Qwen targets multilingual (esp. Chinese-English-code). Architecture follows LLaMA (RoPE, SwiGLU, RMSNorm, pre-norm) but with massive 151,851-token vocabulary (5× LLaMA’s 32K) to handle Chinese characters + English + code efficiently.

Qwen-1 (2023): 7B uses 32L, 4096H, 32 heads (MHA). 3T tokens with aggressive deduplication/filtering. Multi-stage: pre-train → SFT → RLHF.

Qwen-2 (2024): 29 languages, improved code (more GitHub data), 128K ctx via RoPE scaling.

Qwen-2.5 (2024): 18T tokens (6× more), 0.5B-72B sizes. Specialized: Qwen-2.5-Coder, Qwen-2.5-Math.

16.5 DeepSeek (DeepSeek AI)

DeepSeek-V2 (2024) solves long-context memory via Multi-head Latent Attention (MLA). Problem: 128K context → KV cache dominates memory. Solution: low-rank projection: \[\begin{align*} K & = W_K^{\text{down}} W_K^{\text{up}} X \in \mathbb{R}^{d_c \times n} \quad (d_c \ll d_k) \\ V & = W_V^{\text{down}} W_V^{\text{up}} X \end{align*}\] where $d_c=512$, $d_k=5120$ → 10× KV cache reduction.

Architecture: 236B total, 21B active/token (MoE: 64 experts, top-6 routing). 60L, 5120H, 128K ctx. Training: 8.1T tokens (Chinese + English + code), $5.5M cost.

Impact: MLA makes long-context practical. LLaMA-70B (8K ctx) needs massive clusters; DeepSeek-V2 (128K ctx, 21B active) runs on modest hardware. Trade-off: complexity + slight quality loss from low-rank projection.

16.6 Phi Models (Microsoft)

Phi-1 (2023): 1.3B params, trained on high-quality code/reasoning data

Phi-2 (2023): 2.7B params, outperforms 7B models on reasoning benchmarks

Phi-3 (2024):

Sizes: 3.8B (mini), 7B, 14B
Context: 128K tokens
Key idea: Small models with carefully curated data (3.3T tokens)
Architecture: Standard decoder-only (similar to LLaMA)

Philosophy: Quality over quantity–smaller models trained on synthetic reasoning data can match larger models

17 Encoder-Decoder Models

17.1 T5 (Text-to-Text Transfer Transformer)

Paper: Raffel et al., 2020 (Google)

Architecture:

Encoder-decoder (original Transformer from Vaswani et al.)
Sizes: 60M, 220M, 770M, 3B, 11B params
11B spec: 24 encoder layers, 24 decoder layers, 1024 hidden, 16 heads
Relative position embeddings (not absolute)
SentencePiece tokenization (32K vocab)

Key Innovation: Unified text-to-text framework–all tasks framed as "input text → output text"

Translation: "translate English to German: That is good." → "Das ist gut."
Classification: "sentiment: This movie is great" → "positive"
Summarization: "summarize: [long text]" → "[summary]"

Training:

C4 dataset (Colossal Clean Crawled Corpus): 750GB text
Pre-training objective: Span corruption (mask consecutive spans, predict them)
Multi-task fine-tuning on supervised tasks

Use Cases:

Translation, summarization (better than decoder-only for these)
Question answering
Text classification (via text-to-text)

17.2 BART (Bidirectional and Auto-Regressive Transformers)

Paper: Lewis et al., 2020 (Facebook/Meta)

Architecture:

Encoder-decoder (similar to T5)
BART-Large: 12 encoder layers, 12 decoder layers, 1024 hidden, 406M params
Standard Transformer architecture (absolute position embeddings, GELU)

Pre-training: Denoising autoencoder with various corruption strategies:

Token masking (like BERT)
Token deletion
Text infilling (replace spans with single mask token)
Sentence permutation (shuffle sentences)
Document rotation (rotate document to start from random token)

Differences from T5:

BART uses ReLU/GELU (T5 uses gated activations)
BART has more diverse corruption (T5 uses span corruption only)
BART trained on smaller data (160GB vs 750GB)

Use Cases:

Summarization (CNN/DailyMail state-of-the-art)
Text generation with strong understanding (encoder helps)
Translation (fine-tuned)

Note

Interview Insight: Encoder-decoder models (T5, BART) excel at sequence-to-sequence tasks where input and output are different (translation, summarization). Decoder-only models (GPT, LLaMA) dominate for generation tasks where output continues/responds to input. Modern trend: Even summarization/translation increasingly done with decoder-only via prompting.

18 Architecture Comparison Tables

18.1 Key Architectural Choices

Model	Pos Enc	Norm	Activation	Attn	Norm Pos
BERT	Learned	LayerNorm	GELU	Full	Post
GPT-2	Learned	LayerNorm	GELU	Causal	Pre
GPT-3	Learned	LayerNorm	GELU	Causal	Pre
T5	Relative	LayerNorm	ReLU	Full	Pre
LLaMA	RoPE	RMSNorm	SwiGLU	Causal	Pre
Mistral	RoPE	RMSNorm	SwiGLU	Sliding	Pre
Qwen	RoPE	RMSNorm	SwiGLU	Causal	Pre
DeepSeek-V2	RoPE	RMSNorm	SwiGLU	MLA	Pre

18.2 Model Specifications

Model	Params	Layers	Hidden	Heads	Context
BERT-Base	110M	12	768	12	512
BERT-Large	340M	24	1024	16	512
GPT-2	1.5B	48	1600	25	1024
GPT-3	175B	96	12,288	96	2048
T5-11B	11B	24/24	1024	128	512
LLaMA-7B	7B	32	4096	32	2048
LLaMA-2-70B	70B	80	8192	64	4096
LLaMA-3-8B	8B	32	4096	32	8192
LLaMA-3-70B	70B	80	8192	64	8192
LLaMA-3-405B	405B	126	16,384	128	8192
Mistral-7B	7.3B	32	4096	32	8192
Mixtral-8x7B	46.7B	32	4096	32	32K
Qwen-7B	7B	32	4096	32	8192
Qwen-72B	72B	80	8192	64	32K
DeepSeek-V2	236B	60	5120	128	128K
Phi-3-mini	3.8B	32	3072	32	128K

18.3 Training Data Comparison

Model	Tokens	Key Datasets
BERT	3.3B words	BooksCorpus + Wikipedia
GPT-2	10B tokens	WebText (Reddit links)
GPT-3	300B tokens	CommonCrawl + Books + Wikipedia
LLaMA-1	1.4T tokens	CommonCrawl + C4 + GitHub + arXiv
LLaMA-2	2T tokens	Higher quality CC + code
LLaMA-3	15T tokens	Curated web + multilingual
Qwen-2.5	18T tokens	Multilingual + code + math
DeepSeek-V2	8.1T tokens	Chinese + English + code

19 When to Use Which Architecture

19.1 Decision Framework

Classification / Embeddings / NER: BERT-like (RoBERTa, DeBERTa) for bidirectional understanding. Modern alternative: instruction-tuned decoder models (LLaMA, Mistral) for flexibility at cost of speed.

Text generation / Chat: Decoder-only (LLaMA-3, Qwen, Mistral). Long context: DeepSeek-V2 (MLA), LLaMA-3 (128K). Efficiency: Mixtral-8x7B (13B active/47B total), Phi-3 (3.8B, strong reasoning).

Translation / Summarization: Traditional: T5/BART (encoder-decoder for distinct input/output). Modern: decoder-only via prompting (simpler). Encoder-decoder still better for very long documents (bidirectional compression).

Code: Specialized models (DeepSeek-Coder, Qwen-2.5-Coder, StarCoder, CodeLlama) excel at syntax/multi-file context. General models (LLaMA-3, Qwen) work reasonably.

Reasoning / Math: Scale helps: GPT-4, LLaMA-3-405B, DeepSeek-V2. Small: Phi-3 (synthetic reasoning data). Specialized: Qwen-2.5-Math.

Note

Production considerations:

Latency requirements: Real-time applications need 7B-13B models; batch processing can use 70B+
Cost optimization: MoE models (Mixtral) activate fewer parameters; 4-bit quantization halves memory
Long context (100K+ tokens): DeepSeek-V2 with MLA compression or streaming attention architectures
Multilingual needs: Qwen (151K vocab) and LLaMA-3 (128K vocab) trained on diverse languages
On-premise deployment: Open-weight models (LLaMA, Mistral, Qwen) vs API-only (GPT-4, Claude)

19.2 Computational Budget Considerations

Budget	Inference	Fine-Tuning
Single GPU (24GB)	Phi-3-mini (3.8B)	LoRA on 7B models
2-4 GPUs (A100)	LLaMA-3-8B, Mistral-7B	Full fine-tune 7B
8 GPUs (A100)	LLaMA-3-70B, Mixtral-8x7B	LoRA on 70B
16+ GPUs	LLaMA-3-405B, DeepSeek-V2	Full fine-tune 70B
API only	GPT-4, Claude-3.5	Few-shot prompting

Note

Interview Tip: For production systems, consider:

Latency: Smaller models (7B-13B) for real-time applications
Cost: MoE models (Mixtral) or quantization (4-bit) for efficiency
Long context: DeepSeek-V2 (MLA) or streaming attention for 100K+ tokens
Multilingual: Qwen, LLaMA-3 (extensive multilingual training)

19.3 Model Pros & Cons Summary

Model	Pros	Cons
BERT	Best for classification/NER; bidirectional	No generation; 512 token limit
GPT-3	Powerful; in-context learning	Closed; expensive API; 2048 context
T5/BART	Strong at seq2seq tasks	Slower than decoder-only; encoder+decoder complexity
LLaMA-2-7B	Open weights; efficient; good quality	4K context; weaker than GPT-3.5
LLaMA-3-8B	8K context; 15T tokens; strong	Requires more GPU memory than LLaMA-2
LLaMA-3-70B	Near GPT-4 quality; open	140GB memory (fp16); slow inference
LLaMA-3-405B	SOTA open model	810GB memory; requires many GPUs
Mistral-7B	Sliding window; 8K context; fast	Smaller training data than LLaMA
Mixtral-8x7B	47B params, 13B active; strong	94GB memory (all experts); complex deployment
Qwen-7B	Excellent Chinese; strong code	Less English training than LLaMA
Qwen-72B	Multilingual; 32K context	144GB memory; fewer users than LLaMA
DeepSeek-V2	MLA = 10$\times$ KV reduction; 128K	236B total params; complex architecture
Phi-3-mini	3.8B but strong reasoning	Narrow training data; less general

20 Practical Considerations

20.1 Context Length Handling

Technique	Models	Max Context
Sliding Window	Mistral	Effective $\infty$ (4K window)
RoPE Scaling	LLaMA-2/3	128K (from 4K training)
MLA	DeepSeek-V2	128K (low KV cache)
Sparse Attention	Longformer, BigBird	16K-32K
Chunking + RAG	Any model	Arbitrary (retrieve relevant)

20.2 Quantization Options

16-bit (fp16/bf16): Standard training/inference, no quality loss
8-bit (int8): 2$\times$ memory reduction, minimal quality loss (LLM.int8())
4-bit (NF4): 4$\times$ reduction, slight quality loss (QLoRA, GPTQ, AWQ)
3-bit or lower: Noticeable degradation, only for extreme resource constraints

Rule of thumb:

7B model: 14GB (fp16), 7GB (8-bit), 3.5GB (4-bit)
70B model: 140GB (fp16), 70GB (8-bit), 35GB (4-bit)

20.3 Fine-Tuning Strategies

Full fine-tuning: Update all parameters – best quality, high memory/compute
LoRA (Low-Rank Adaptation): Add trainable low-rank matrices $\Delta W = AB$ where $A \in \mathbb{R}^{d \times r}$, $B \in \mathbb{R}^{r \times d}$, $r \ll d$
- Memory: Only store $2dr$ params (vs $d^2$ for full)
- Quality: 90-95% of full fine-tuning performance
- Common rank: $r=8$ to $r=64$
QLoRA: LoRA + 4-bit quantization – fine-tune 70B on single 48GB GPU
Prefix tuning: Add trainable prompt embeddings (100-1000 tokens)
Prompt tuning: Train soft prompts only (even fewer params than prefix)

21 Key Innovations Timeline

2017: Transformer (Vaswani et al.) – attention is all you need
2018: BERT (bidirectional pre-training), GPT-1 (generative pre-training)
2019: GPT-2 (1.5B, zero-shot learning), RoBERTa (better BERT training), T5 (text-to-text)
2020: GPT-3 (175B, in-context learning), BART (denoising encoder-decoder)
2021: Codex (code generation), Switch Transformer (1.6T params MoE)
2022: ChatGPT (GPT-3.5 + RLHF), InstructGPT (alignment via human feedback)
2023: GPT-4 (multi-modal), LLaMA-1/2 (open weights), Mistral (sliding window), Claude-2 (100K context)
2024: LLaMA-3 (405B, 15T tokens), Qwen-2.5 (18T tokens), DeepSeek-V2 (MLA), Mixtral-8x22B, GPT-4o (omni-modal)

22 Future Directions

22.1 Emerging Trends (2024-2025)

Mixture-of-Experts scaling: Sparse models with 1T+ params, 100B active
Long context (1M+ tokens): Gemini 1.5 (10M tokens), improved KV compression
Multi-modal fusion: Vision + language tightly integrated (GPT-4o, Gemini 1.5)
Test-time compute: Models that "think longer" for harder problems (o1, o3)
Efficient architectures: State-space models (Mamba), linear attention variants
Post-training innovation: Better RLHF, DPO, synthetic data for reasoning

22.2 Open Research Questions

How to train 10T+ parameter models efficiently?
Can we get GPT-4 quality with 10B params via better data/algorithms?
Optimal MoE routing strategies (current routers are simple)
How to handle truly long context (10M+ tokens) efficiently?
Better alternatives to transformer attention for long sequences?

23 Interview Questions

Note

Q1: Why did decoder-only models replace encoder-decoder for most tasks?

Unified architecture: Same model for pre-training and all downstream tasks
KV caching: Efficient autoregressive generation (reuse past keys/values)
Scaling laws: Decoder-only performance improves more predictably with scale
Instruction-tuning: Can handle any task via prompting (no separate heads needed)

Q2: What’s the difference between Multi-Head Attention (MHA) and Grouped Query Attention (GQA)?

MHA: Each head has separate K, V projections → $h$ sets of KV cache
GQA: Groups of query heads share KV projections → fewer KV heads (e.g., 32 Q heads, 8 KV heads)
Benefit: Reduces KV cache by 4$\times$ (critical for long context), minimal quality loss
Examples: LLaMA-3 (GQA with 8 KV heads), Mistral (GQA)

Q3: Why is RoPE (Rotary Position Embedding) better than learned positional embeddings?

Relative positions: Attention score depends on $(m-n)$ naturally, not absolute positions
Extrapolation: Can handle sequences longer than training length via rotation interpolation
No learned params: Deterministic rotation based on position and frequency
Used by: All modern LLMs (LLaMA, Qwen, Mistral, DeepSeek)

Q4: How does Mixture-of-Experts (MoE) improve efficiency?

Sparse activation: Only activate top-K experts per token (e.g., Mixtral uses 2 of 8)
Total params $\gg$ active params: 46.7B total, 12.9B active (Mixtral-8x7B)
Same latency: Experts run in parallel, no sequential overhead
Trade-off: Larger memory footprint (must load all experts), more complex training

Q5: What’s the difference between SwiGLU and GELU activations?

GELU: Smooth approximation of ReLU, $\text{GELU}(x) = x \cdot \Phi(x)$ (Gaussian CDF)
SwiGLU: Gated linear unit with Swish, $\text{SwiGLU}(x) = \text{Swish}(xW_1) \odot (xW_2)$
Why SwiGLU: Better performance empirically (PaLM paper), standard in modern LLMs (LLaMA, etc.)
Cost: SwiGLU has 2 projections (higher compute), but worth it for quality

Q6: Why does DeepSeek-V2 use Multi-head Latent Attention (MLA)?

Problem: KV cache dominates memory for long contexts (e.g., 128K tokens)
Solution: Low-rank projection $K = W^{down} W^{up} X$ where $d_c \ll d_k$ (512 vs 5120)
Benefit: 10$\times$ KV cache reduction → can fit 128K context with 21B active params
Trade-off: Slight quality loss, more complex implementation

Q7: When would you choose T5/BART over a decoder-only model?

Structured tasks: Translation, summarization where input/output are distinct
Long inputs, short outputs: Encoder compresses entire input bidirectionally
Legacy systems: Already deployed with T5/BART
Modern alternative: Decoder-only works well via prompting, simpler to deploy

Q8: How much data is needed to train from scratch?

A (Ballpark Estimates):

ResNet-50 (25M params): 1M-10M images (ImageNet has 1.3M). Transfer learning common with fewer.
BERT-Base (110M params): 10-100GB text (Wikipedia 16GB + BookCorpus 4GB). Original used BooksCorpus + Wiki.
RoBERTa-Base (125M params): 100-160GB text (160GB used in paper). More data than BERT → better performance.
GPT-3 (175B params): 300B tokens (Common Crawl filtered, WebText, Books). 570GB compressed.
LLaMA-7B (7B params): 1T tokens (1.4TB text). LLaMA-2 used 2T tokens.
LLaMA-70B (70B params): 1T-2T tokens (same data, longer training).
General rule: Chinchilla scaling laws suggest 20 tokens per parameter (70B model → 1.4T tokens).

Practical Takeaways:

Vision models: 1K-10K images per class minimum; transfer learning recommended for $<$100K images
Small LMs (BERT-size): 10GB-100GB text corpus (can scrape domain-specific data)
Large LMs ($>$7B): Requires web-scale data (100GB-1TB+); most orgs fine-tune pretrained models
Data quality $>$ quantity: LLaMA outperforms larger models trained on noisier data