12  Chapter 11: Key Architectures (ResNet, BERT, GPT, Qwen)

13 Overview

This chapter covers production architectures in vision and NLP: ResNet, BERT, GPT, LLaMA, and modern variants. We examine why design choices matter–not just specs, but the reasoning from billions of GPU-hours of experiments.

Key questions: Why bottleneck blocks in ResNet? Why pre-norm in GPT-2? Why RMSNorm in LLaMA? We trace evolution from GPT-2 (1.5B) to LLaMA-3 (405B), BERT’s 512 tokens to Gemini’s 10M, and production trade-offs: 7B vs 70B, decoder-only vs encoder-decoder, MoE vs dense.

Note

Cross-Reference Guide:

  • RoPE mechanism: See attention.tex §3.2

  • GQA vs MHA vs MQA: See attention.tex §3.3

  • MLA (Multi-head Latent Attention): See attention.tex (DeepSeek section)

  • Sliding window attention: See attention.tex §4

  • MoE routing & load balancing: See attention.tex §8

  • Flash Attention: See attention.tex §4.3

  • Distributed training (ZeRO, FSDP): See training_optimization.tex

13.1 Architecture Taxonomy

Three paradigms:

Encoder-only (BERT, RoBERTa): Bidirectional attention → rich understanding, no generation. Best for classification, NER, embeddings.

Decoder-only (GPT, LLaMA, Mistral): Causal attention (\(i\) sees \(\leq i\)) → efficient KV caching, dominant 2023-2024. Why: unified pre-training, efficient generation, predictable scaling, instruction-tuning flexibility.

Encoder-decoder (T5, BART): Bidirectional encoder + autoregressive decoder. Natural for translation/summarization but more complex (dual stacks, no prefix caching).

14 Vision Architectures

14.1 ResNet (Deep Residual Networks)

Before ResNet, deep networks suffered optimization failure–more layers made performance worse (vanishing/exploding gradients). ResNet (He et al., 2015) solved this with residual connections.

The Key Insight: Instead of learning the desired output \(H(x)\) directly, learn the residual \(F(x) = H(x) - x\) and add it to the input: \[y = x + F(x), \quad \text{where } F(x) \text{ is learned by the layer's weights}\]

Important: The network directly learns \(F(x)\)–we never compute \(H(x)\) explicitly. The notation \(F(x) = H(x) - x\) is just showing the relationship: if we want output \(H(x)\), the residual branch must produce \(F(x) = H(x) - x\).

Why this helps: If the optimal mapping is close to identity (\(H(x) \approx x\)), it’s easier to learn \(F(x) \approx 0\) (push weights toward zero) than to learn \(H(x) = x\) from scratch (requires precise weight tuning to reproduce the input). Skip connections also create direct gradient paths–gradients flow backwards through the identity without attenuation.

Block types: Basic blocks (ResNet-18/34): two \(3 \times 3\) convs. Bottleneck blocks (ResNet-50/101/152): \(1\times1 \rightarrow 3\times3 \rightarrow 1\times1\) (compress → process → expand). Bottleneck uses 70% fewer FLOPs than stacking \(3\times3\) convs.

Bottleneck Block (ResNet-50/101/152):

image

Key Details:

  • Compress: \(1\times1\) conv reduces channels (256 → 64) to save computation

  • Process: \(3\times3\) conv operates on reduced dimension (70% fewer FLOPs)

  • Expand: \(1\times1\) conv restores channels (64 → 256) to match skip connection

  • BatchNorm + ReLU: After each conv (except last–ReLU comes after addition)

  • Skip path: Identity when dimensions match; \(1\times1\) projection when changing channels/spatial size

  • Final ReLU: Applied after adding residual, not before

ResNet-50: Stem (\(7\times7\) conv/2 + maxpool) → 4 stages of bottlenecks (256 → 512 → 1024 → 2048 channels) → global avg pool → FC. ResNet variants (ResNeXt, EfficientNet) modify blocks but keep residual principle.

ResNet-50 macro-architecture:

image

14.2 MobileNet: Efficient CNNs for Mobile/Edge

MobileNet (Howard et al., 2017) achieves mobile-friendly efficiency via depthwise separable convolutions.

Standard Convolution:

  • Input: \(D_F \times D_F \times M\) (spatial \(D_F\), \(M\) input channels)

  • Kernel: \(D_K \times D_K \times M \times N\) (\(N\) output channels)

  • Cost: \(D_K \times D_K \times M \times N \times D_F \times D_F\) MACs (multiply-accumulates)

Depthwise Separable = Depthwise + Pointwise:

  1. Depthwise conv: \(3\times3\) conv per channel (no mixing channels)

    • Kernel: \(D_K \times D_K \times 1\) per channel (total \(M\) kernels)

    • Cost: \(D_K \times D_K \times M \times D_F \times D_F\) MACs

  2. Pointwise conv: \(1\times1\) conv to mix channels

    • Kernel: \(1 \times 1 \times M \times N\)

    • Cost: \(M \times N \times D_F \times D_F\) MACs

Reduction Factor: \[\frac{\text{Depthwise separable cost}}{\text{Standard conv cost}} = \frac{D_K^2 \cdot M \cdot D_F^2 + M \cdot N \cdot D_F^2}{D_K^2 \cdot M \cdot N \cdot D_F^2} = \frac{1}{N} + \frac{1}{D_K^2}\] For \(3\times3\) convs (\(D_K=3\)) and many channels (\(N \gg 1\)): reduction \(\approx \frac{1}{9} + \epsilon\) → **8-9× fewer MACs**.

MobileNetV2/V3 Innovations:

  • Inverted residuals (V2): Expand channels in bottleneck (\(64 \to 384 \to 64\)) instead of compress

  • Linear bottleneck: Remove ReLU before projection (preserve information in low-dim space)

  • Squeeze-and-Excitation (V3): Channel attention mechanism (lightweight)

  • h-swish activation (V3): Hardware-friendly approximation of Swish

14.3 EfficientNet: Compound Scaling

EfficientNet (Tan & Le, 2019) optimizes depth, width, and resolution jointly via compound scaling.

Traditional Scaling (suboptimal):

  • Depth scaling: More layers (ResNet-50 → ResNet-152)

  • Width scaling: More channels per layer (ResNet-50 → WideResNet)

  • Resolution scaling: Larger input images (224×224 → 299×299)

Scaling one dimension hits diminishing returns; EfficientNet scales all three.

Compound Scaling Formula: \[\text{depth: } d = \alpha^\phi, \quad \text{width: } w = \beta^\phi, \quad \text{resolution: } r = \gamma^\phi\] subject to \(\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2\) and \(\alpha \geq 1, \beta \geq 1, \gamma \geq 1\).

Constraint ensures FLOPs grow as \(2^\phi\) (doubling compute per step). Grid search finds \(\alpha=1.2, \beta=1.1, \gamma=1.15\) (EfficientNet-B0 baseline).

MBConv Block (EfficientNet building block):

  • Inverted residual bottleneck (MobileNetV2-style)

  • Squeeze-and-Excitation (SE) attention

  • Stochastic depth (drop path regularization)

EfficientNet Family:

  • B0: 5.3M params, 0.39B FLOPs, baseline found via neural architecture search (NAS)

  • B1-B7: Scale B0 using compound scaling (\(\phi = 1, 2, \ldots, 7\))

  • B7: 66M params, 37B FLOPs, 84.3% ImageNet top-1 (SOTA at publication)

  • EfficientNetV2: Faster training (Fused-MBConv blocks, adaptive regularization)

Why This Matters:

  • EfficientNet-B4 (19M params) matches ResNet-152 (60M params) accuracy with 10× fewer FLOPs

  • Demonstrates importance of balanced scaling vs just "more layers"

  • Widely used for mobile/edge vision (object detection, segmentation) before ViTs

15 Encoder-Only Models

15.1 BERT (Bidirectional Encoder Representations from Transformers)

BERT (Devlin et al., 2019) brought bidirectional pre-training to transformers. Unlike GPT’s left-to-right modeling, BERT sees full sequence (past + future) when encoding each token → powerful for understanding, unsuitable for generation.

Architecture: \(N\) encoder blocks (self-attn + FFN), learned position embeddings, post-norm (LayerNorm after attn/FFN). BERT-Base: 12L, 768H, 12 heads, 110M params. BERT-Large: 24L, 1024H, 16 heads, 340M params.

Training: MLM (mask 15% tokens, predict from bidirectional context) + NSP (next sentence prediction). Data: BooksCorpus + Wikipedia (3.3B words), 512 tokens max.

Limitations: 512-token limit too short, bidirectional prevents generation, MLM creates pre-train/fine-tune mismatch ([MASK] seen in training, not inference).

15.2 RoBERTa (Robustly Optimized BERT)

RoBERTa (Liu et al., 2019, Meta): same architecture as BERT-Large, but better training. Remove NSP (harmful), dynamic masking (different masks each epoch), larger batches (8K vs 256), more data (160GB vs 16GB), longer training (500K steps). Result: SOTA on GLUE/SQuAD/RACE without architectural changes.

Lesson: How you train often matters more than architecture. LLaMA-3 gains over LLaMA-2 come from 10× more tokens, not architecture.

Note

Interview Insight: BERT is rarely used for generation (no causal masking). Modern practice: Use decoder-only models (LLaMA, Mistral) even for classification via instruction-tuning. BERT still relevant for embeddings (sentence-transformers) and low-latency classification where you need bidirectional understanding without generation.

16 Decoder-Only Models

16.1 GPT (Generative Pre-trained Transformer)

GPT established the decoder-only paradigm dominating modern LLMs. Unlike BERT’s bidirectional encoding, GPT uses causal attention: token \(i\) only sees positions \(0\) through \(i-1\). This enables autoregressive generation but sacrifices bidirectional context.

GPT-1 (2018): 12 layers, 768 dim, 117M params on BooksCorpus. Key insight: unsupervised pre-training transfers to supervised tasks.

GPT-2 (2019): 1.5B parameters. Crucial change: pre-normalization (LayerNorm before attention/FFN) stabilizes deep network training via cleaner gradient flow. Demonstrated zero-shot learning by framing tasks as text completion.

Decoder-only transformer stack:

image

GPT-3 (2020): 175B params (96L, 12,288H, 2048 ctx). Demonstrated in-context learning–few-shot prompting without gradient updates. Training: $4.6M on 10K V100s, 300B tokens. Architecture similar to GPT-2 but with sparse attention in some layers. Real innovation: scale enabled instruction-following, coding, arithmetic from emergent capabilities.

GPT-4 (2023): Estimated 1.7T params (MoE,  280B active/token). Multi-modal (text+images), 128K context. Extensive RLHF for alignment.

16.2 LLaMA Family (Meta)

Meta’s LLaMA democratized LLM research with competitive open-weights models on public data. Its architecture became the open-source standard.

LLaMA-1 (2023): 7B/13B/33B/65B sizes. 7B spec: 32L, 4096H, 32 heads (\(d_k=128\)), 2048 ctx (→4096 via RoPE).

Key innovations:

  • RoPE: \(q_m = R_m q, k_n = R_n k\) → attention depends on \((m-n)\), enables length extrapolation

  • SwiGLU: \(\text{SwiGLU}(x) = \text{Swish}(xW_1) \odot (xW_2)\) → better quality

  • RMSNorm: \(\frac{x}{\sqrt{\text{mean}(x^2) + \epsilon}} \cdot \gamma\) → simpler than LayerNorm

  • No bias terms → better training stability at scale

Training: 1.4T tokens (CommonCrawl, C4, GitHub, Wikipedia, arXiv, StackExchange). 65B took 21 days on 2048 A100s.

LLaMA-2 (2023): 2T tokens, 4K context, GQA (groups query heads to share KV projections → 4-8× KV cache reduction). Added instruction-tuning + RLHF, competitive with GPT-3.5.

LLaMA-3 (2024): 15T tokens (10× more), 128K vocab (from 32K), 8K ctx → 128K via fine-tuning. All sizes use GQA (8 KV heads). 405B flagship: 126L, 16,384H, 128 Q heads, 8 KV heads. Trained on 16K H100s. Multi-stage RLHF + rejection sampling + DPO.

TipExample

LLaMA Architecture Summary (7B/8B):

Component LLaMA-1 (7B) LLaMA-3 (8B)
Layers 32 32
Hidden dim 4096 4096
Heads 32 32
KV heads 32 (MHA) 8 (GQA)
Context 2048 8192
Vocab 32K 128K
Tokens trained 1T 15T
FFN dim 11,008 14,336

16.3 Mistral & Mixtral (Mistral AI)

Mistral 7B (Jiang et al., 2023):

Architecture:

  • 32 layers, 4096 hidden, 32 heads, 7.3B params

  • Sliding Window Attention (SWA): Each layer only attends to previous 4096 tokens (window size)

  • GQA: 8 KV heads (4\(\times\) compression)

  • Context: 8K tokens (via SWA), effective receptive field grows with layers

  • RoPE, SwiGLU, RMSNorm (same as LLaMA)

Key Innovation: Sliding window allows longer context with \(O(w)\) memory per layer (vs \(O(n)\) for full attention), where \(w=4096\)

Mixtral 8x7B (Jiang et al., 2024):

Architecture:

  • Sparse Mixture-of-Experts (MoE)

  • 8 experts per layer (each expert is 7B FFN)

  • Router selects top-2 experts per token

  • Total: 46.7B params, only 12.9B active per token

  • 32 layers, 4096 hidden, same attention as Mistral 7B

  • Context: 32K tokens

MoE Benefits:

  • Inference cost of 12.9B model, performance of 46.7B model

  • Outperforms LLaMA-2 70B on most benchmarks

  • Same latency as Mistral 7B (2\(\times\) experts in parallel)

16.4 Qwen (Alibaba Cloud)

Alibaba’s Qwen targets multilingual (esp. Chinese-English-code). Architecture follows LLaMA (RoPE, SwiGLU, RMSNorm, pre-norm) but with massive 151,851-token vocabulary (5× LLaMA’s 32K) to handle Chinese characters + English + code efficiently.

Qwen-1 (2023): 7B uses 32L, 4096H, 32 heads (MHA). 3T tokens with aggressive deduplication/filtering. Multi-stage: pre-train → SFT → RLHF.

Qwen-2 (2024): 29 languages, improved code (more GitHub data), 128K ctx via RoPE scaling.

Qwen-2.5 (2024): 18T tokens (6× more), 0.5B-72B sizes. Specialized: Qwen-2.5-Coder, Qwen-2.5-Math.

16.5 DeepSeek (DeepSeek AI)

DeepSeek-V2 (2024) solves long-context memory via Multi-head Latent Attention (MLA). Problem: 128K context → KV cache dominates memory. Solution: low-rank projection: \[\begin{align*} K & = W_K^{\text{down}} W_K^{\text{up}} X \in \mathbb{R}^{d_c \times n} \quad (d_c \ll d_k) \\ V & = W_V^{\text{down}} W_V^{\text{up}} X \end{align*}\] where \(d_c=512\), \(d_k=5120\) → 10× KV cache reduction.

Architecture: 236B total, 21B active/token (MoE: 64 experts, top-6 routing). 60L, 5120H, 128K ctx. Training: 8.1T tokens (Chinese + English + code), $5.5M cost.

Impact: MLA makes long-context practical. LLaMA-70B (8K ctx) needs massive clusters; DeepSeek-V2 (128K ctx, 21B active) runs on modest hardware. Trade-off: complexity + slight quality loss from low-rank projection.

16.6 Phi Models (Microsoft)

Phi-1 (2023): 1.3B params, trained on high-quality code/reasoning data

Phi-2 (2023): 2.7B params, outperforms 7B models on reasoning benchmarks

Phi-3 (2024):

  • Sizes: 3.8B (mini), 7B, 14B

  • Context: 128K tokens

  • Key idea: Small models with carefully curated data (3.3T tokens)

  • Architecture: Standard decoder-only (similar to LLaMA)

Philosophy: Quality over quantity–smaller models trained on synthetic reasoning data can match larger models

17 Encoder-Decoder Models

17.1 T5 (Text-to-Text Transfer Transformer)

Paper: Raffel et al., 2020 (Google)

Architecture:

  • Encoder-decoder (original Transformer from Vaswani et al.)

  • Sizes: 60M, 220M, 770M, 3B, 11B params

  • 11B spec: 24 encoder layers, 24 decoder layers, 1024 hidden, 16 heads

  • Relative position embeddings (not absolute)

  • SentencePiece tokenization (32K vocab)

Key Innovation: Unified text-to-text framework–all tasks framed as "input text → output text"

  • Translation: "translate English to German: That is good." → "Das ist gut."

  • Classification: "sentiment: This movie is great" → "positive"

  • Summarization: "summarize: [long text]" → "[summary]"

Training:

  • C4 dataset (Colossal Clean Crawled Corpus): 750GB text

  • Pre-training objective: Span corruption (mask consecutive spans, predict them)

  • Multi-task fine-tuning on supervised tasks

Use Cases:

  • Translation, summarization (better than decoder-only for these)

  • Question answering

  • Text classification (via text-to-text)

17.2 BART (Bidirectional and Auto-Regressive Transformers)

Paper: Lewis et al., 2020 (Facebook/Meta)

Architecture:

  • Encoder-decoder (similar to T5)

  • BART-Large: 12 encoder layers, 12 decoder layers, 1024 hidden, 406M params

  • Standard Transformer architecture (absolute position embeddings, GELU)

Pre-training: Denoising autoencoder with various corruption strategies:

  • Token masking (like BERT)

  • Token deletion

  • Text infilling (replace spans with single mask token)

  • Sentence permutation (shuffle sentences)

  • Document rotation (rotate document to start from random token)

Differences from T5:

  • BART uses ReLU/GELU (T5 uses gated activations)

  • BART has more diverse corruption (T5 uses span corruption only)

  • BART trained on smaller data (160GB vs 750GB)

Use Cases:

  • Summarization (CNN/DailyMail state-of-the-art)

  • Text generation with strong understanding (encoder helps)

  • Translation (fine-tuned)

Note

Interview Insight: Encoder-decoder models (T5, BART) excel at sequence-to-sequence tasks where input and output are different (translation, summarization). Decoder-only models (GPT, LLaMA) dominate for generation tasks where output continues/responds to input. Modern trend: Even summarization/translation increasingly done with decoder-only via prompting.

18 Architecture Comparison Tables

18.1 Key Architectural Choices

Model Pos Enc Norm Activation Attn Norm Pos
BERT Learned LayerNorm GELU Full Post
GPT-2 Learned LayerNorm GELU Causal Pre
GPT-3 Learned LayerNorm GELU Causal Pre
T5 Relative LayerNorm ReLU Full Pre
LLaMA RoPE RMSNorm SwiGLU Causal Pre
Mistral RoPE RMSNorm SwiGLU Sliding Pre
Qwen RoPE RMSNorm SwiGLU Causal Pre
DeepSeek-V2 RoPE RMSNorm SwiGLU MLA Pre

18.2 Model Specifications

Model Params Layers Hidden Heads Context
BERT-Base 110M 12 768 12 512
BERT-Large 340M 24 1024 16 512
GPT-2 1.5B 48 1600 25 1024
GPT-3 175B 96 12,288 96 2048
T5-11B 11B 24/24 1024 128 512
LLaMA-7B 7B 32 4096 32 2048
LLaMA-2-70B 70B 80 8192 64 4096
LLaMA-3-8B 8B 32 4096 32 8192
LLaMA-3-70B 70B 80 8192 64 8192
LLaMA-3-405B 405B 126 16,384 128 8192
Mistral-7B 7.3B 32 4096 32 8192
Mixtral-8x7B 46.7B 32 4096 32 32K
Qwen-7B 7B 32 4096 32 8192
Qwen-72B 72B 80 8192 64 32K
DeepSeek-V2 236B 60 5120 128 128K
Phi-3-mini 3.8B 32 3072 32 128K

18.3 Training Data Comparison

Model Tokens Key Datasets
BERT 3.3B words BooksCorpus + Wikipedia
GPT-2 10B tokens WebText (Reddit links)
GPT-3 300B tokens CommonCrawl + Books + Wikipedia
LLaMA-1 1.4T tokens CommonCrawl + C4 + GitHub + arXiv
LLaMA-2 2T tokens Higher quality CC + code
LLaMA-3 15T tokens Curated web + multilingual
Qwen-2.5 18T tokens Multilingual + code + math
DeepSeek-V2 8.1T tokens Chinese + English + code

19 When to Use Which Architecture

19.1 Decision Framework

Classification / Embeddings / NER: BERT-like (RoBERTa, DeBERTa) for bidirectional understanding. Modern alternative: instruction-tuned decoder models (LLaMA, Mistral) for flexibility at cost of speed.

Text generation / Chat: Decoder-only (LLaMA-3, Qwen, Mistral). Long context: DeepSeek-V2 (MLA), LLaMA-3 (128K). Efficiency: Mixtral-8x7B (13B active/47B total), Phi-3 (3.8B, strong reasoning).

Translation / Summarization: Traditional: T5/BART (encoder-decoder for distinct input/output). Modern: decoder-only via prompting (simpler). Encoder-decoder still better for very long documents (bidirectional compression).

Code: Specialized models (DeepSeek-Coder, Qwen-2.5-Coder, StarCoder, CodeLlama) excel at syntax/multi-file context. General models (LLaMA-3, Qwen) work reasonably.

Reasoning / Math: Scale helps: GPT-4, LLaMA-3-405B, DeepSeek-V2. Small: Phi-3 (synthetic reasoning data). Specialized: Qwen-2.5-Math.

Note

Production considerations:

  • Latency requirements: Real-time applications need 7B-13B models; batch processing can use 70B+

  • Cost optimization: MoE models (Mixtral) activate fewer parameters; 4-bit quantization halves memory

  • Long context (100K+ tokens): DeepSeek-V2 with MLA compression or streaming attention architectures

  • Multilingual needs: Qwen (151K vocab) and LLaMA-3 (128K vocab) trained on diverse languages

  • On-premise deployment: Open-weight models (LLaMA, Mistral, Qwen) vs API-only (GPT-4, Claude)

19.2 Computational Budget Considerations

Budget Inference Fine-Tuning
Single GPU (24GB) Phi-3-mini (3.8B) LoRA on 7B models
2-4 GPUs (A100) LLaMA-3-8B, Mistral-7B Full fine-tune 7B
8 GPUs (A100) LLaMA-3-70B, Mixtral-8x7B LoRA on 70B
16+ GPUs LLaMA-3-405B, DeepSeek-V2 Full fine-tune 70B
API only GPT-4, Claude-3.5 Few-shot prompting
Note

Interview Tip: For production systems, consider:

  • Latency: Smaller models (7B-13B) for real-time applications

  • Cost: MoE models (Mixtral) or quantization (4-bit) for efficiency

  • Long context: DeepSeek-V2 (MLA) or streaming attention for 100K+ tokens

  • Multilingual: Qwen, LLaMA-3 (extensive multilingual training)

19.3 Model Pros & Cons Summary

Model Pros Cons
BERT Best for classification/NER; bidirectional No generation; 512 token limit
GPT-3 Powerful; in-context learning Closed; expensive API; 2048 context
T5/BART Strong at seq2seq tasks Slower than decoder-only; encoder+decoder complexity
LLaMA-2-7B Open weights; efficient; good quality 4K context; weaker than GPT-3.5
LLaMA-3-8B 8K context; 15T tokens; strong Requires more GPU memory than LLaMA-2
LLaMA-3-70B Near GPT-4 quality; open 140GB memory (fp16); slow inference
LLaMA-3-405B SOTA open model 810GB memory; requires many GPUs
Mistral-7B Sliding window; 8K context; fast Smaller training data than LLaMA
Mixtral-8x7B 47B params, 13B active; strong 94GB memory (all experts); complex deployment
Qwen-7B Excellent Chinese; strong code Less English training than LLaMA
Qwen-72B Multilingual; 32K context 144GB memory; fewer users than LLaMA
DeepSeek-V2 MLA = 10\(\times\) KV reduction; 128K 236B total params; complex architecture
Phi-3-mini 3.8B but strong reasoning Narrow training data; less general

20 Practical Considerations

20.1 Context Length Handling

Technique Models Max Context
Sliding Window Mistral Effective \(\infty\) (4K window)
RoPE Scaling LLaMA-2/3 128K (from 4K training)
MLA DeepSeek-V2 128K (low KV cache)
Sparse Attention Longformer, BigBird 16K-32K
Chunking + RAG Any model Arbitrary (retrieve relevant)

20.2 Quantization Options

  • 16-bit (fp16/bf16): Standard training/inference, no quality loss

  • 8-bit (int8): 2\(\times\) memory reduction, minimal quality loss (LLM.int8())

  • 4-bit (NF4): 4\(\times\) reduction, slight quality loss (QLoRA, GPTQ, AWQ)

  • 3-bit or lower: Noticeable degradation, only for extreme resource constraints

Rule of thumb:

  • 7B model: 14GB (fp16), 7GB (8-bit), 3.5GB (4-bit)

  • 70B model: 140GB (fp16), 70GB (8-bit), 35GB (4-bit)

20.3 Fine-Tuning Strategies

  • Full fine-tuning: Update all parameters – best quality, high memory/compute

  • LoRA (Low-Rank Adaptation): Add trainable low-rank matrices \(\Delta W = AB\) where \(A \in \mathbb{R}^{d \times r}\), \(B \in \mathbb{R}^{r \times d}\), \(r \ll d\)

    • Memory: Only store \(2dr\) params (vs \(d^2\) for full)

    • Quality: 90-95% of full fine-tuning performance

    • Common rank: \(r=8\) to \(r=64\)

  • QLoRA: LoRA + 4-bit quantization – fine-tune 70B on single 48GB GPU

  • Prefix tuning: Add trainable prompt embeddings (100-1000 tokens)

  • Prompt tuning: Train soft prompts only (even fewer params than prefix)

21 Key Innovations Timeline

  • 2017: Transformer (Vaswani et al.) – attention is all you need

  • 2018: BERT (bidirectional pre-training), GPT-1 (generative pre-training)

  • 2019: GPT-2 (1.5B, zero-shot learning), RoBERTa (better BERT training), T5 (text-to-text)

  • 2020: GPT-3 (175B, in-context learning), BART (denoising encoder-decoder)

  • 2021: Codex (code generation), Switch Transformer (1.6T params MoE)

  • 2022: ChatGPT (GPT-3.5 + RLHF), InstructGPT (alignment via human feedback)

  • 2023: GPT-4 (multi-modal), LLaMA-1/2 (open weights), Mistral (sliding window), Claude-2 (100K context)

  • 2024: LLaMA-3 (405B, 15T tokens), Qwen-2.5 (18T tokens), DeepSeek-V2 (MLA), Mixtral-8x22B, GPT-4o (omni-modal)

22 Future Directions

22.2 Open Research Questions

  • How to train 10T+ parameter models efficiently?

  • Can we get GPT-4 quality with 10B params via better data/algorithms?

  • Optimal MoE routing strategies (current routers are simple)

  • How to handle truly long context (10M+ tokens) efficiently?

  • Better alternatives to transformer attention for long sequences?

23 Interview Questions

Note

Q1: Why did decoder-only models replace encoder-decoder for most tasks?

A:

  • Unified architecture: Same model for pre-training and all downstream tasks

  • KV caching: Efficient autoregressive generation (reuse past keys/values)

  • Scaling laws: Decoder-only performance improves more predictably with scale

  • Instruction-tuning: Can handle any task via prompting (no separate heads needed)

Q2: What’s the difference between Multi-Head Attention (MHA) and Grouped Query Attention (GQA)?

A:

  • MHA: Each head has separate K, V projections → \(h\) sets of KV cache

  • GQA: Groups of query heads share KV projections → fewer KV heads (e.g., 32 Q heads, 8 KV heads)

  • Benefit: Reduces KV cache by 4\(\times\) (critical for long context), minimal quality loss

  • Examples: LLaMA-3 (GQA with 8 KV heads), Mistral (GQA)

Q3: Why is RoPE (Rotary Position Embedding) better than learned positional embeddings?

A:

  • Relative positions: Attention score depends on \((m-n)\) naturally, not absolute positions

  • Extrapolation: Can handle sequences longer than training length via rotation interpolation

  • No learned params: Deterministic rotation based on position and frequency

  • Used by: All modern LLMs (LLaMA, Qwen, Mistral, DeepSeek)

Q4: How does Mixture-of-Experts (MoE) improve efficiency?

A:

  • Sparse activation: Only activate top-K experts per token (e.g., Mixtral uses 2 of 8)

  • Total params \(\gg\) active params: 46.7B total, 12.9B active (Mixtral-8x7B)

  • Same latency: Experts run in parallel, no sequential overhead

  • Trade-off: Larger memory footprint (must load all experts), more complex training

Q5: What’s the difference between SwiGLU and GELU activations?

A:

  • GELU: Smooth approximation of ReLU, \(\text{GELU}(x) = x \cdot \Phi(x)\) (Gaussian CDF)

  • SwiGLU: Gated linear unit with Swish, \(\text{SwiGLU}(x) = \text{Swish}(xW_1) \odot (xW_2)\)

  • Why SwiGLU: Better performance empirically (PaLM paper), standard in modern LLMs (LLaMA, etc.)

  • Cost: SwiGLU has 2 projections (higher compute), but worth it for quality

Q6: Why does DeepSeek-V2 use Multi-head Latent Attention (MLA)?

A:

  • Problem: KV cache dominates memory for long contexts (e.g., 128K tokens)

  • Solution: Low-rank projection \(K = W^{down} W^{up} X\) where \(d_c \ll d_k\) (512 vs 5120)

  • Benefit: 10\(\times\) KV cache reduction → can fit 128K context with 21B active params

  • Trade-off: Slight quality loss, more complex implementation

Q7: When would you choose T5/BART over a decoder-only model?

A:

  • Structured tasks: Translation, summarization where input/output are distinct

  • Long inputs, short outputs: Encoder compresses entire input bidirectionally

  • Legacy systems: Already deployed with T5/BART

  • Modern alternative: Decoder-only works well via prompting, simpler to deploy

Q8: How much data is needed to train from scratch?

A (Ballpark Estimates):

  • ResNet-50 (25M params): 1M-10M images (ImageNet has 1.3M). Transfer learning common with fewer.

  • BERT-Base (110M params): 10-100GB text (Wikipedia 16GB + BookCorpus 4GB). Original used BooksCorpus + Wiki.

  • RoBERTa-Base (125M params): 100-160GB text (160GB used in paper). More data than BERT → better performance.

  • GPT-3 (175B params): 300B tokens (Common Crawl filtered, WebText, Books).  570GB compressed.

  • LLaMA-7B (7B params): 1T tokens (1.4TB text). LLaMA-2 used 2T tokens.

  • LLaMA-70B (70B params): 1T-2T tokens (same data, longer training).

  • General rule: Chinchilla scaling laws suggest  20 tokens per parameter (70B model → 1.4T tokens).

Practical Takeaways:

  • Vision models: 1K-10K images per class minimum; transfer learning recommended for \(<\)100K images

  • Small LMs (BERT-size): 10GB-100GB text corpus (can scrape domain-specific data)

  • Large LMs (\(>\)7B): Requires web-scale data (100GB-1TB+); most orgs fine-tune pretrained models

  • Data quality \(>\) quantity: LLaMA outperforms larger models trained on noisier data