12 Chapter 11: Key Architectures (ResNet, BERT, GPT, Qwen)
13 Overview
This chapter covers production architectures in vision and NLP: ResNet, BERT, GPT, LLaMA, and modern variants. We examine why design choices matter–not just specs, but the reasoning from billions of GPU-hours of experiments.
Key questions: Why bottleneck blocks in ResNet? Why pre-norm in GPT-2? Why RMSNorm in LLaMA? We trace evolution from GPT-2 (1.5B) to LLaMA-3 (405B), BERT’s 512 tokens to Gemini’s 10M, and production trade-offs: 7B vs 70B, decoder-only vs encoder-decoder, MoE vs dense.
Cross-Reference Guide:
RoPE mechanism: See attention.tex §3.2
GQA vs MHA vs MQA: See attention.tex §3.3
MLA (Multi-head Latent Attention): See attention.tex (DeepSeek section)
Sliding window attention: See attention.tex §4
MoE routing & load balancing: See attention.tex §8
Flash Attention: See attention.tex §4.3
Distributed training (ZeRO, FSDP): See training_optimization.tex
13.1 Architecture Taxonomy
Three paradigms:
Encoder-only (BERT, RoBERTa): Bidirectional attention → rich understanding, no generation. Best for classification, NER, embeddings.
Decoder-only (GPT, LLaMA, Mistral): Causal attention (\(i\) sees \(\leq i\)) → efficient KV caching, dominant 2023-2024. Why: unified pre-training, efficient generation, predictable scaling, instruction-tuning flexibility.
Encoder-decoder (T5, BART): Bidirectional encoder + autoregressive decoder. Natural for translation/summarization but more complex (dual stacks, no prefix caching).
14 Vision Architectures
14.1 ResNet (Deep Residual Networks)
Before ResNet, deep networks suffered optimization failure–more layers made performance worse (vanishing/exploding gradients). ResNet (He et al., 2015) solved this with residual connections.
The Key Insight: Instead of learning the desired output \(H(x)\) directly, learn the residual \(F(x) = H(x) - x\) and add it to the input: \[y = x + F(x), \quad \text{where } F(x) \text{ is learned by the layer's weights}\]
Important: The network directly learns \(F(x)\)–we never compute \(H(x)\) explicitly. The notation \(F(x) = H(x) - x\) is just showing the relationship: if we want output \(H(x)\), the residual branch must produce \(F(x) = H(x) - x\).
Why this helps: If the optimal mapping is close to identity (\(H(x) \approx x\)), it’s easier to learn \(F(x) \approx 0\) (push weights toward zero) than to learn \(H(x) = x\) from scratch (requires precise weight tuning to reproduce the input). Skip connections also create direct gradient paths–gradients flow backwards through the identity without attenuation.
Block types: Basic blocks (ResNet-18/34): two \(3 \times 3\) convs. Bottleneck blocks (ResNet-50/101/152): \(1\times1 \rightarrow 3\times3 \rightarrow 1\times1\) (compress → process → expand). Bottleneck uses 70% fewer FLOPs than stacking \(3\times3\) convs.
Bottleneck Block (ResNet-50/101/152):
Key Details:
Compress: \(1\times1\) conv reduces channels (256 → 64) to save computation
Process: \(3\times3\) conv operates on reduced dimension (70% fewer FLOPs)
Expand: \(1\times1\) conv restores channels (64 → 256) to match skip connection
BatchNorm + ReLU: After each conv (except last–ReLU comes after addition)
Skip path: Identity when dimensions match; \(1\times1\) projection when changing channels/spatial size
Final ReLU: Applied after adding residual, not before
ResNet-50: Stem (\(7\times7\) conv/2 + maxpool) → 4 stages of bottlenecks (256 → 512 → 1024 → 2048 channels) → global avg pool → FC. ResNet variants (ResNeXt, EfficientNet) modify blocks but keep residual principle.
ResNet-50 macro-architecture:
14.2 MobileNet: Efficient CNNs for Mobile/Edge
MobileNet (Howard et al., 2017) achieves mobile-friendly efficiency via depthwise separable convolutions.
Standard Convolution:
Input: \(D_F \times D_F \times M\) (spatial \(D_F\), \(M\) input channels)
Kernel: \(D_K \times D_K \times M \times N\) (\(N\) output channels)
Cost: \(D_K \times D_K \times M \times N \times D_F \times D_F\) MACs (multiply-accumulates)
Depthwise Separable = Depthwise + Pointwise:
Depthwise conv: \(3\times3\) conv per channel (no mixing channels)
Kernel: \(D_K \times D_K \times 1\) per channel (total \(M\) kernels)
Cost: \(D_K \times D_K \times M \times D_F \times D_F\) MACs
Pointwise conv: \(1\times1\) conv to mix channels
Kernel: \(1 \times 1 \times M \times N\)
Cost: \(M \times N \times D_F \times D_F\) MACs
Reduction Factor: \[\frac{\text{Depthwise separable cost}}{\text{Standard conv cost}} = \frac{D_K^2 \cdot M \cdot D_F^2 + M \cdot N \cdot D_F^2}{D_K^2 \cdot M \cdot N \cdot D_F^2} = \frac{1}{N} + \frac{1}{D_K^2}\] For \(3\times3\) convs (\(D_K=3\)) and many channels (\(N \gg 1\)): reduction \(\approx \frac{1}{9} + \epsilon\) → **8-9× fewer MACs**.
MobileNetV2/V3 Innovations:
Inverted residuals (V2): Expand channels in bottleneck (\(64 \to 384 \to 64\)) instead of compress
Linear bottleneck: Remove ReLU before projection (preserve information in low-dim space)
Squeeze-and-Excitation (V3): Channel attention mechanism (lightweight)
h-swish activation (V3): Hardware-friendly approximation of Swish
14.3 EfficientNet: Compound Scaling
EfficientNet (Tan & Le, 2019) optimizes depth, width, and resolution jointly via compound scaling.
Traditional Scaling (suboptimal):
Depth scaling: More layers (ResNet-50 → ResNet-152)
Width scaling: More channels per layer (ResNet-50 → WideResNet)
Resolution scaling: Larger input images (224×224 → 299×299)
Scaling one dimension hits diminishing returns; EfficientNet scales all three.
Compound Scaling Formula: \[\text{depth: } d = \alpha^\phi, \quad \text{width: } w = \beta^\phi, \quad \text{resolution: } r = \gamma^\phi\] subject to \(\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2\) and \(\alpha \geq 1, \beta \geq 1, \gamma \geq 1\).
Constraint ensures FLOPs grow as \(2^\phi\) (doubling compute per step). Grid search finds \(\alpha=1.2, \beta=1.1, \gamma=1.15\) (EfficientNet-B0 baseline).
MBConv Block (EfficientNet building block):
Inverted residual bottleneck (MobileNetV2-style)
Squeeze-and-Excitation (SE) attention
Stochastic depth (drop path regularization)
EfficientNet Family:
B0: 5.3M params, 0.39B FLOPs, baseline found via neural architecture search (NAS)
B1-B7: Scale B0 using compound scaling (\(\phi = 1, 2, \ldots, 7\))
B7: 66M params, 37B FLOPs, 84.3% ImageNet top-1 (SOTA at publication)
EfficientNetV2: Faster training (Fused-MBConv blocks, adaptive regularization)
Why This Matters:
EfficientNet-B4 (19M params) matches ResNet-152 (60M params) accuracy with 10× fewer FLOPs
Demonstrates importance of balanced scaling vs just "more layers"
Widely used for mobile/edge vision (object detection, segmentation) before ViTs
15 Encoder-Only Models
15.1 BERT (Bidirectional Encoder Representations from Transformers)
BERT (Devlin et al., 2019) brought bidirectional pre-training to transformers. Unlike GPT’s left-to-right modeling, BERT sees full sequence (past + future) when encoding each token → powerful for understanding, unsuitable for generation.
Architecture: \(N\) encoder blocks (self-attn + FFN), learned position embeddings, post-norm (LayerNorm after attn/FFN). BERT-Base: 12L, 768H, 12 heads, 110M params. BERT-Large: 24L, 1024H, 16 heads, 340M params.
Training: MLM (mask 15% tokens, predict from bidirectional context) + NSP (next sentence prediction). Data: BooksCorpus + Wikipedia (3.3B words), 512 tokens max.
Limitations: 512-token limit too short, bidirectional prevents generation, MLM creates pre-train/fine-tune mismatch ([MASK] seen in training, not inference).
15.2 RoBERTa (Robustly Optimized BERT)
RoBERTa (Liu et al., 2019, Meta): same architecture as BERT-Large, but better training. Remove NSP (harmful), dynamic masking (different masks each epoch), larger batches (8K vs 256), more data (160GB vs 16GB), longer training (500K steps). Result: SOTA on GLUE/SQuAD/RACE without architectural changes.
Lesson: How you train often matters more than architecture. LLaMA-3 gains over LLaMA-2 come from 10× more tokens, not architecture.
Interview Insight: BERT is rarely used for generation (no causal masking). Modern practice: Use decoder-only models (LLaMA, Mistral) even for classification via instruction-tuning. BERT still relevant for embeddings (sentence-transformers) and low-latency classification where you need bidirectional understanding without generation.
16 Decoder-Only Models
16.1 GPT (Generative Pre-trained Transformer)
GPT established the decoder-only paradigm dominating modern LLMs. Unlike BERT’s bidirectional encoding, GPT uses causal attention: token \(i\) only sees positions \(0\) through \(i-1\). This enables autoregressive generation but sacrifices bidirectional context.
GPT-1 (2018): 12 layers, 768 dim, 117M params on BooksCorpus. Key insight: unsupervised pre-training transfers to supervised tasks.
GPT-2 (2019): 1.5B parameters. Crucial change: pre-normalization (LayerNorm before attention/FFN) stabilizes deep network training via cleaner gradient flow. Demonstrated zero-shot learning by framing tasks as text completion.
Decoder-only transformer stack:
GPT-3 (2020): 175B params (96L, 12,288H, 2048 ctx). Demonstrated in-context learning–few-shot prompting without gradient updates. Training: $4.6M on 10K V100s, 300B tokens. Architecture similar to GPT-2 but with sparse attention in some layers. Real innovation: scale enabled instruction-following, coding, arithmetic from emergent capabilities.
GPT-4 (2023): Estimated 1.7T params (MoE, 280B active/token). Multi-modal (text+images), 128K context. Extensive RLHF for alignment.
16.2 LLaMA Family (Meta)
Meta’s LLaMA democratized LLM research with competitive open-weights models on public data. Its architecture became the open-source standard.
LLaMA-1 (2023): 7B/13B/33B/65B sizes. 7B spec: 32L, 4096H, 32 heads (\(d_k=128\)), 2048 ctx (→4096 via RoPE).
Key innovations:
RoPE: \(q_m = R_m q, k_n = R_n k\) → attention depends on \((m-n)\), enables length extrapolation
SwiGLU: \(\text{SwiGLU}(x) = \text{Swish}(xW_1) \odot (xW_2)\) → better quality
RMSNorm: \(\frac{x}{\sqrt{\text{mean}(x^2) + \epsilon}} \cdot \gamma\) → simpler than LayerNorm
No bias terms → better training stability at scale
Training: 1.4T tokens (CommonCrawl, C4, GitHub, Wikipedia, arXiv, StackExchange). 65B took 21 days on 2048 A100s.
LLaMA-2 (2023): 2T tokens, 4K context, GQA (groups query heads to share KV projections → 4-8× KV cache reduction). Added instruction-tuning + RLHF, competitive with GPT-3.5.
LLaMA-3 (2024): 15T tokens (10× more), 128K vocab (from 32K), 8K ctx → 128K via fine-tuning. All sizes use GQA (8 KV heads). 405B flagship: 126L, 16,384H, 128 Q heads, 8 KV heads. Trained on 16K H100s. Multi-stage RLHF + rejection sampling + DPO.
LLaMA Architecture Summary (7B/8B):
| Component | LLaMA-1 (7B) | LLaMA-3 (8B) |
|---|---|---|
| Layers | 32 | 32 |
| Hidden dim | 4096 | 4096 |
| Heads | 32 | 32 |
| KV heads | 32 (MHA) | 8 (GQA) |
| Context | 2048 | 8192 |
| Vocab | 32K | 128K |
| Tokens trained | 1T | 15T |
| FFN dim | 11,008 | 14,336 |
16.3 Mistral & Mixtral (Mistral AI)
Mistral 7B (Jiang et al., 2023):
Architecture:
32 layers, 4096 hidden, 32 heads, 7.3B params
Sliding Window Attention (SWA): Each layer only attends to previous 4096 tokens (window size)
GQA: 8 KV heads (4\(\times\) compression)
Context: 8K tokens (via SWA), effective receptive field grows with layers
RoPE, SwiGLU, RMSNorm (same as LLaMA)
Key Innovation: Sliding window allows longer context with \(O(w)\) memory per layer (vs \(O(n)\) for full attention), where \(w=4096\)
Mixtral 8x7B (Jiang et al., 2024):
Architecture:
Sparse Mixture-of-Experts (MoE)
8 experts per layer (each expert is 7B FFN)
Router selects top-2 experts per token
Total: 46.7B params, only 12.9B active per token
32 layers, 4096 hidden, same attention as Mistral 7B
Context: 32K tokens
MoE Benefits:
Inference cost of 12.9B model, performance of 46.7B model
Outperforms LLaMA-2 70B on most benchmarks
Same latency as Mistral 7B (2\(\times\) experts in parallel)
16.4 Qwen (Alibaba Cloud)
Alibaba’s Qwen targets multilingual (esp. Chinese-English-code). Architecture follows LLaMA (RoPE, SwiGLU, RMSNorm, pre-norm) but with massive 151,851-token vocabulary (5× LLaMA’s 32K) to handle Chinese characters + English + code efficiently.
Qwen-1 (2023): 7B uses 32L, 4096H, 32 heads (MHA). 3T tokens with aggressive deduplication/filtering. Multi-stage: pre-train → SFT → RLHF.
Qwen-2 (2024): 29 languages, improved code (more GitHub data), 128K ctx via RoPE scaling.
Qwen-2.5 (2024): 18T tokens (6× more), 0.5B-72B sizes. Specialized: Qwen-2.5-Coder, Qwen-2.5-Math.
16.5 DeepSeek (DeepSeek AI)
DeepSeek-V2 (2024) solves long-context memory via Multi-head Latent Attention (MLA). Problem: 128K context → KV cache dominates memory. Solution: low-rank projection: \[\begin{align*} K & = W_K^{\text{down}} W_K^{\text{up}} X \in \mathbb{R}^{d_c \times n} \quad (d_c \ll d_k) \\ V & = W_V^{\text{down}} W_V^{\text{up}} X \end{align*}\] where \(d_c=512\), \(d_k=5120\) → 10× KV cache reduction.
Architecture: 236B total, 21B active/token (MoE: 64 experts, top-6 routing). 60L, 5120H, 128K ctx. Training: 8.1T tokens (Chinese + English + code), $5.5M cost.
Impact: MLA makes long-context practical. LLaMA-70B (8K ctx) needs massive clusters; DeepSeek-V2 (128K ctx, 21B active) runs on modest hardware. Trade-off: complexity + slight quality loss from low-rank projection.
16.6 Phi Models (Microsoft)
Phi-1 (2023): 1.3B params, trained on high-quality code/reasoning data
Phi-2 (2023): 2.7B params, outperforms 7B models on reasoning benchmarks
Phi-3 (2024):
Sizes: 3.8B (mini), 7B, 14B
Context: 128K tokens
Key idea: Small models with carefully curated data (3.3T tokens)
Architecture: Standard decoder-only (similar to LLaMA)
Philosophy: Quality over quantity–smaller models trained on synthetic reasoning data can match larger models
17 Encoder-Decoder Models
17.1 T5 (Text-to-Text Transfer Transformer)
Paper: Raffel et al., 2020 (Google)
Architecture:
Encoder-decoder (original Transformer from Vaswani et al.)
Sizes: 60M, 220M, 770M, 3B, 11B params
11B spec: 24 encoder layers, 24 decoder layers, 1024 hidden, 16 heads
Relative position embeddings (not absolute)
SentencePiece tokenization (32K vocab)
Key Innovation: Unified text-to-text framework–all tasks framed as "input text → output text"
Translation: "translate English to German: That is good." → "Das ist gut."
Classification: "sentiment: This movie is great" → "positive"
Summarization: "summarize: [long text]" → "[summary]"
Training:
C4 dataset (Colossal Clean Crawled Corpus): 750GB text
Pre-training objective: Span corruption (mask consecutive spans, predict them)
Multi-task fine-tuning on supervised tasks
Use Cases:
Translation, summarization (better than decoder-only for these)
Question answering
Text classification (via text-to-text)
17.2 BART (Bidirectional and Auto-Regressive Transformers)
Paper: Lewis et al., 2020 (Facebook/Meta)
Architecture:
Encoder-decoder (similar to T5)
BART-Large: 12 encoder layers, 12 decoder layers, 1024 hidden, 406M params
Standard Transformer architecture (absolute position embeddings, GELU)
Pre-training: Denoising autoencoder with various corruption strategies:
Token masking (like BERT)
Token deletion
Text infilling (replace spans with single mask token)
Sentence permutation (shuffle sentences)
Document rotation (rotate document to start from random token)
Differences from T5:
BART uses ReLU/GELU (T5 uses gated activations)
BART has more diverse corruption (T5 uses span corruption only)
BART trained on smaller data (160GB vs 750GB)
Use Cases:
Summarization (CNN/DailyMail state-of-the-art)
Text generation with strong understanding (encoder helps)
Translation (fine-tuned)
Interview Insight: Encoder-decoder models (T5, BART) excel at sequence-to-sequence tasks where input and output are different (translation, summarization). Decoder-only models (GPT, LLaMA) dominate for generation tasks where output continues/responds to input. Modern trend: Even summarization/translation increasingly done with decoder-only via prompting.
18 Architecture Comparison Tables
18.1 Key Architectural Choices
| Model | Pos Enc | Norm | Activation | Attn | Norm Pos | |
|---|---|---|---|---|---|---|
| BERT | Learned | LayerNorm | GELU | Full | Post | |
| GPT-2 | Learned | LayerNorm | GELU | Causal | Pre | |
| GPT-3 | Learned | LayerNorm | GELU | Causal | Pre | |
| T5 | Relative | LayerNorm | ReLU | Full | Pre | |
| LLaMA | RoPE | RMSNorm | SwiGLU | Causal | Pre | |
| Mistral | RoPE | RMSNorm | SwiGLU | Sliding | Pre | |
| Qwen | RoPE | RMSNorm | SwiGLU | Causal | Pre | |
| DeepSeek-V2 | RoPE | RMSNorm | SwiGLU | MLA | Pre |
18.2 Model Specifications
| Model | Params | Layers | Hidden | Heads | Context |
|---|---|---|---|---|---|
| BERT-Base | 110M | 12 | 768 | 12 | 512 |
| BERT-Large | 340M | 24 | 1024 | 16 | 512 |
| GPT-2 | 1.5B | 48 | 1600 | 25 | 1024 |
| GPT-3 | 175B | 96 | 12,288 | 96 | 2048 |
| T5-11B | 11B | 24/24 | 1024 | 128 | 512 |
| LLaMA-7B | 7B | 32 | 4096 | 32 | 2048 |
| LLaMA-2-70B | 70B | 80 | 8192 | 64 | 4096 |
| LLaMA-3-8B | 8B | 32 | 4096 | 32 | 8192 |
| LLaMA-3-70B | 70B | 80 | 8192 | 64 | 8192 |
| LLaMA-3-405B | 405B | 126 | 16,384 | 128 | 8192 |
| Mistral-7B | 7.3B | 32 | 4096 | 32 | 8192 |
| Mixtral-8x7B | 46.7B | 32 | 4096 | 32 | 32K |
| Qwen-7B | 7B | 32 | 4096 | 32 | 8192 |
| Qwen-72B | 72B | 80 | 8192 | 64 | 32K |
| DeepSeek-V2 | 236B | 60 | 5120 | 128 | 128K |
| Phi-3-mini | 3.8B | 32 | 3072 | 32 | 128K |
18.3 Training Data Comparison
| Model | Tokens | Key Datasets |
|---|---|---|
| BERT | 3.3B words | BooksCorpus + Wikipedia |
| GPT-2 | 10B tokens | WebText (Reddit links) |
| GPT-3 | 300B tokens | CommonCrawl + Books + Wikipedia |
| LLaMA-1 | 1.4T tokens | CommonCrawl + C4 + GitHub + arXiv |
| LLaMA-2 | 2T tokens | Higher quality CC + code |
| LLaMA-3 | 15T tokens | Curated web + multilingual |
| Qwen-2.5 | 18T tokens | Multilingual + code + math |
| DeepSeek-V2 | 8.1T tokens | Chinese + English + code |
19 When to Use Which Architecture
19.1 Decision Framework
Classification / Embeddings / NER: BERT-like (RoBERTa, DeBERTa) for bidirectional understanding. Modern alternative: instruction-tuned decoder models (LLaMA, Mistral) for flexibility at cost of speed.
Text generation / Chat: Decoder-only (LLaMA-3, Qwen, Mistral). Long context: DeepSeek-V2 (MLA), LLaMA-3 (128K). Efficiency: Mixtral-8x7B (13B active/47B total), Phi-3 (3.8B, strong reasoning).
Translation / Summarization: Traditional: T5/BART (encoder-decoder for distinct input/output). Modern: decoder-only via prompting (simpler). Encoder-decoder still better for very long documents (bidirectional compression).
Code: Specialized models (DeepSeek-Coder, Qwen-2.5-Coder, StarCoder, CodeLlama) excel at syntax/multi-file context. General models (LLaMA-3, Qwen) work reasonably.
Reasoning / Math: Scale helps: GPT-4, LLaMA-3-405B, DeepSeek-V2. Small: Phi-3 (synthetic reasoning data). Specialized: Qwen-2.5-Math.
Production considerations:
Latency requirements: Real-time applications need 7B-13B models; batch processing can use 70B+
Cost optimization: MoE models (Mixtral) activate fewer parameters; 4-bit quantization halves memory
Long context (100K+ tokens): DeepSeek-V2 with MLA compression or streaming attention architectures
Multilingual needs: Qwen (151K vocab) and LLaMA-3 (128K vocab) trained on diverse languages
On-premise deployment: Open-weight models (LLaMA, Mistral, Qwen) vs API-only (GPT-4, Claude)
19.2 Computational Budget Considerations
| Budget | Inference | Fine-Tuning |
|---|---|---|
| Single GPU (24GB) | Phi-3-mini (3.8B) | LoRA on 7B models |
| 2-4 GPUs (A100) | LLaMA-3-8B, Mistral-7B | Full fine-tune 7B |
| 8 GPUs (A100) | LLaMA-3-70B, Mixtral-8x7B | LoRA on 70B |
| 16+ GPUs | LLaMA-3-405B, DeepSeek-V2 | Full fine-tune 70B |
| API only | GPT-4, Claude-3.5 | Few-shot prompting |
Interview Tip: For production systems, consider:
Latency: Smaller models (7B-13B) for real-time applications
Cost: MoE models (Mixtral) or quantization (4-bit) for efficiency
Long context: DeepSeek-V2 (MLA) or streaming attention for 100K+ tokens
Multilingual: Qwen, LLaMA-3 (extensive multilingual training)
19.3 Model Pros & Cons Summary
| Model | Pros | Cons |
|---|---|---|
| BERT | Best for classification/NER; bidirectional | No generation; 512 token limit |
| GPT-3 | Powerful; in-context learning | Closed; expensive API; 2048 context |
| T5/BART | Strong at seq2seq tasks | Slower than decoder-only; encoder+decoder complexity |
| LLaMA-2-7B | Open weights; efficient; good quality | 4K context; weaker than GPT-3.5 |
| LLaMA-3-8B | 8K context; 15T tokens; strong | Requires more GPU memory than LLaMA-2 |
| LLaMA-3-70B | Near GPT-4 quality; open | 140GB memory (fp16); slow inference |
| LLaMA-3-405B | SOTA open model | 810GB memory; requires many GPUs |
| Mistral-7B | Sliding window; 8K context; fast | Smaller training data than LLaMA |
| Mixtral-8x7B | 47B params, 13B active; strong | 94GB memory (all experts); complex deployment |
| Qwen-7B | Excellent Chinese; strong code | Less English training than LLaMA |
| Qwen-72B | Multilingual; 32K context | 144GB memory; fewer users than LLaMA |
| DeepSeek-V2 | MLA = 10\(\times\) KV reduction; 128K | 236B total params; complex architecture |
| Phi-3-mini | 3.8B but strong reasoning | Narrow training data; less general |
20 Practical Considerations
20.1 Context Length Handling
| Technique | Models | Max Context |
|---|---|---|
| Sliding Window | Mistral | Effective \(\infty\) (4K window) |
| RoPE Scaling | LLaMA-2/3 | 128K (from 4K training) |
| MLA | DeepSeek-V2 | 128K (low KV cache) |
| Sparse Attention | Longformer, BigBird | 16K-32K |
| Chunking + RAG | Any model | Arbitrary (retrieve relevant) |
20.2 Quantization Options
16-bit (fp16/bf16): Standard training/inference, no quality loss
8-bit (int8): 2\(\times\) memory reduction, minimal quality loss (LLM.int8())
4-bit (NF4): 4\(\times\) reduction, slight quality loss (QLoRA, GPTQ, AWQ)
3-bit or lower: Noticeable degradation, only for extreme resource constraints
Rule of thumb:
7B model: 14GB (fp16), 7GB (8-bit), 3.5GB (4-bit)
70B model: 140GB (fp16), 70GB (8-bit), 35GB (4-bit)
20.3 Fine-Tuning Strategies
Full fine-tuning: Update all parameters – best quality, high memory/compute
LoRA (Low-Rank Adaptation): Add trainable low-rank matrices \(\Delta W = AB\) where \(A \in \mathbb{R}^{d \times r}\), \(B \in \mathbb{R}^{r \times d}\), \(r \ll d\)
Memory: Only store \(2dr\) params (vs \(d^2\) for full)
Quality: 90-95% of full fine-tuning performance
Common rank: \(r=8\) to \(r=64\)
QLoRA: LoRA + 4-bit quantization – fine-tune 70B on single 48GB GPU
Prefix tuning: Add trainable prompt embeddings (100-1000 tokens)
Prompt tuning: Train soft prompts only (even fewer params than prefix)
21 Key Innovations Timeline
2017: Transformer (Vaswani et al.) – attention is all you need
2018: BERT (bidirectional pre-training), GPT-1 (generative pre-training)
2019: GPT-2 (1.5B, zero-shot learning), RoBERTa (better BERT training), T5 (text-to-text)
2020: GPT-3 (175B, in-context learning), BART (denoising encoder-decoder)
2021: Codex (code generation), Switch Transformer (1.6T params MoE)
2022: ChatGPT (GPT-3.5 + RLHF), InstructGPT (alignment via human feedback)
2023: GPT-4 (multi-modal), LLaMA-1/2 (open weights), Mistral (sliding window), Claude-2 (100K context)
2024: LLaMA-3 (405B, 15T tokens), Qwen-2.5 (18T tokens), DeepSeek-V2 (MLA), Mixtral-8x22B, GPT-4o (omni-modal)
22 Future Directions
22.1 Emerging Trends (2024-2025)
Mixture-of-Experts scaling: Sparse models with 1T+ params, 100B active
Long context (1M+ tokens): Gemini 1.5 (10M tokens), improved KV compression
Multi-modal fusion: Vision + language tightly integrated (GPT-4o, Gemini 1.5)
Test-time compute: Models that "think longer" for harder problems (o1, o3)
Efficient architectures: State-space models (Mamba), linear attention variants
Post-training innovation: Better RLHF, DPO, synthetic data for reasoning
22.2 Open Research Questions
How to train 10T+ parameter models efficiently?
Can we get GPT-4 quality with 10B params via better data/algorithms?
Optimal MoE routing strategies (current routers are simple)
How to handle truly long context (10M+ tokens) efficiently?
Better alternatives to transformer attention for long sequences?
23 Interview Questions
Q1: Why did decoder-only models replace encoder-decoder for most tasks?
A:
Unified architecture: Same model for pre-training and all downstream tasks
KV caching: Efficient autoregressive generation (reuse past keys/values)
Scaling laws: Decoder-only performance improves more predictably with scale
Instruction-tuning: Can handle any task via prompting (no separate heads needed)
Q2: What’s the difference between Multi-Head Attention (MHA) and Grouped Query Attention (GQA)?
A:
MHA: Each head has separate K, V projections → \(h\) sets of KV cache
GQA: Groups of query heads share KV projections → fewer KV heads (e.g., 32 Q heads, 8 KV heads)
Benefit: Reduces KV cache by 4\(\times\) (critical for long context), minimal quality loss
Examples: LLaMA-3 (GQA with 8 KV heads), Mistral (GQA)
Q3: Why is RoPE (Rotary Position Embedding) better than learned positional embeddings?
A:
Relative positions: Attention score depends on \((m-n)\) naturally, not absolute positions
Extrapolation: Can handle sequences longer than training length via rotation interpolation
No learned params: Deterministic rotation based on position and frequency
Used by: All modern LLMs (LLaMA, Qwen, Mistral, DeepSeek)
Q4: How does Mixture-of-Experts (MoE) improve efficiency?
A:
Sparse activation: Only activate top-K experts per token (e.g., Mixtral uses 2 of 8)
Total params \(\gg\) active params: 46.7B total, 12.9B active (Mixtral-8x7B)
Same latency: Experts run in parallel, no sequential overhead
Trade-off: Larger memory footprint (must load all experts), more complex training
Q5: What’s the difference between SwiGLU and GELU activations?
A:
GELU: Smooth approximation of ReLU, \(\text{GELU}(x) = x \cdot \Phi(x)\) (Gaussian CDF)
SwiGLU: Gated linear unit with Swish, \(\text{SwiGLU}(x) = \text{Swish}(xW_1) \odot (xW_2)\)
Why SwiGLU: Better performance empirically (PaLM paper), standard in modern LLMs (LLaMA, etc.)
Cost: SwiGLU has 2 projections (higher compute), but worth it for quality
Q6: Why does DeepSeek-V2 use Multi-head Latent Attention (MLA)?
A:
Problem: KV cache dominates memory for long contexts (e.g., 128K tokens)
Solution: Low-rank projection \(K = W^{down} W^{up} X\) where \(d_c \ll d_k\) (512 vs 5120)
Benefit: 10\(\times\) KV cache reduction → can fit 128K context with 21B active params
Trade-off: Slight quality loss, more complex implementation
Q7: When would you choose T5/BART over a decoder-only model?
A:
Structured tasks: Translation, summarization where input/output are distinct
Long inputs, short outputs: Encoder compresses entire input bidirectionally
Legacy systems: Already deployed with T5/BART
Modern alternative: Decoder-only works well via prompting, simpler to deploy
Q8: How much data is needed to train from scratch?
A (Ballpark Estimates):
ResNet-50 (25M params): 1M-10M images (ImageNet has 1.3M). Transfer learning common with fewer.
BERT-Base (110M params): 10-100GB text (Wikipedia 16GB + BookCorpus 4GB). Original used BooksCorpus + Wiki.
RoBERTa-Base (125M params): 100-160GB text (160GB used in paper). More data than BERT → better performance.
GPT-3 (175B params): 300B tokens (Common Crawl filtered, WebText, Books). 570GB compressed.
LLaMA-7B (7B params): 1T tokens (1.4TB text). LLaMA-2 used 2T tokens.
LLaMA-70B (70B params): 1T-2T tokens (same data, longer training).
General rule: Chinchilla scaling laws suggest 20 tokens per parameter (70B model → 1.4T tokens).
Practical Takeaways:
Vision models: 1K-10K images per class minimum; transfer learning recommended for \(<\)100K images
Small LMs (BERT-size): 10GB-100GB text corpus (can scrape domain-specific data)
Large LMs (\(>\)7B): Requires web-scale data (100GB-1TB+); most orgs fine-tune pretrained models
Data quality \(>\) quantity: LLaMA outperforms larger models trained on noisier data