7 Chapter 6: Neural Network Building Blocks
8 Activation Functions
Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Without activations, deep networks would collapse to linear models.
8.1 Classic Activations
8.1.1 Sigmoid: \(\sigma(x) = \frac{1}{1 + e^{-x}}\)
Properties:
Output range: \((0, 1)\)
Derivative: \(\sigma'(x) = \sigma(x)(1 - \sigma(x))\)
Saturates at both extremes (gradients \(\rightarrow 0\))
Use cases: Binary classification output layer, gate mechanisms (LSTM)
Problems:
Vanishing gradients: For \(|x| > 5\), gradient \(\approx 0\)
Not zero-centered: outputs always positive, causing zig-zagging gradient updates
Expensive computation (exponential)
8.1.2 Tanh: \(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)
Properties:
Output range: \((-1, 1)\)
Derivative: \(\tanh'(x) = 1 - \tanh^2(x)\)
Zero-centered (better than sigmoid)
Still saturates at extremes
Use cases: RNN/LSTM hidden states, traditionally hidden layers
Interview Insight: Tanh is just a scaled sigmoid: \(\tanh(x) = 2\sigma(2x) - 1\). It’s preferred over sigmoid for hidden layers because zero-centered activations make optimization easier.
8.2 Modern Activations
8.2.1 ReLU: \(\text{ReLU}(x) = \max(0, x)\)
Properties:
Output range: \([0, \infty)\)
Derivative: \(\mathbb{1}_{x > 0}\) (0 if \(x \leq 0\), 1 if \(x > 0\))
Does not saturate for positive values
Extremely fast to compute
Advantages:
Accelerates convergence (6× faster than sigmoid/tanh in AlexNet)
Sparse activations (typically 50% neurons are zero)
Gradient flow: constant gradient of 1 for active neurons
Problems:
Dying ReLU: Neurons can get stuck at zero if they receive large negative gradients. Once \(w \cdot x + b < 0\) for all inputs, neuron never activates again.
Not zero-centered
8.2.2 Leaky ReLU: \(\text{LeakyReLU}(x) = \max(\alpha x, x)\)
Fixes dying ReLU by allowing small negative slope (\(\alpha \approx 0.01\)): \[\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}\]
Variants:
PReLU (Parametric ReLU): Learn \(\alpha\) during training via backprop
RReLU (Randomized ReLU): Sample \(\alpha \sim U(l, u)\) during training, use fixed average during inference
8.2.3 ELU: \(\text{ELU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\)
Exponential Linear Unit (ELU) smooths negative values:
Advantages:
Negative saturation pushes mean activations closer to zero (better than Leaky ReLU)
Smooth everywhere (differentiable at 0)
Robust to noise
Disadvantages:
- Expensive exponential computation for \(x < 0\)
8.2.4 GELU: \(\text{GELU}(x) = x \cdot \Phi(x)\)
Gaussian Error Linear Unit used in BERT, GPT, and most modern transformers: \[\text{GELU}(x) = x \cdot P(X \leq x), \quad X \sim \mathcal{N}(0, 1)\]
Approximation: \(\text{GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{2/\pi}(x + 0.044715x^3)\right]\right)\)
Properties:
Smooth, non-monotonic
Stochastic regularization effect (weights inputs by their magnitude)
Empirically better than ReLU for transformers
8.2.5 Swish/SiLU: \(\text{Swish}(x) = x \cdot \sigma(\beta x)\)
Self-Gated activation discovered by Google AutoML: \[\text{Swish}(x) = \frac{x}{1 + e^{-\beta x}}\]
When \(\beta = 1\), called SiLU (Sigmoid Linear Unit). Used in EfficientNet, modern vision models.
Properties:
Smooth, non-monotonic
Self-gating: activation modulates itself
Approaches linear for large positive \(x\), approaches 0 for large negative \(x\)
8.2.6 SwiGLU (Gated MLP)
Key Insight: SwiGLU is not just replacing \(\sigma\) with SiLU. It’s a fundamentally different architecture using gating.
Standard FFN (e.g., GELU): \[\text{FFN}(x) = W_2 \cdot \text{GELU}(x W_1 + b_1) + b_2\] One projection \(\rightarrow\) activation \(\rightarrow\) output projection.
SwiGLU (Gated Linear Unit): \[\text{SwiGLU}(x) = (x W_1) \odot \text{Swish}(x W_2)\] Two parallel projections \(\rightarrow\) element-wise gating (one branch gates the other).
In transformers, the FFN becomes: \[\text{FFN}(x) = W_o(\text{SwiGLU}(x))\]
Properties:
Gating gives data-dependent feature selection (more expressive than plain GELU)
Often improves quality at similar compute (used in PaLM, LLaMA, Mistral)
Extra projection costs more; common trick: reduce hidden size to \(2/3\) of GELU FFN to keep params constant
Related variants: GEGLU (GELU gate), SiGLU (sigmoid gate)
Why \(W_o\)? SwiGLU operates in the expanded hidden size; \(W_o\) projects back to model dimension for the residual add
Activation Function Timeline:
1980s-2000s: Sigmoid, Tanh (neural networks, RNNs)
2010-2012: ReLU revolution (AlexNet, ImageNet breakthrough)
2013-2015: Leaky ReLU, PReLU, ELU (fixing dying ReLU)
2016-2018: GELU, Swish (smooth alternatives for transformers)
2019+: SiLU, Mish (vision models, diffusion models)
Interview Question: Why ReLU over Sigmoid?
Three key reasons:
Gradient flow: ReLU has gradient 1 for active neurons (vs sigmoid’s max 0.25)
Sparsity: 50% neurons zero \(\rightarrow\) efficient representations
Compute: No exponentials, just \(\max(0, x)\)
Trade-off: Dying ReLU problem, mitigated by Leaky ReLU or careful initialization.
9 Regularization Techniques
9.1 Dropout
Dropout randomly sets a fraction \(p\) of neuron activations to zero during training, forcing the network to learn redundant representations and preventing co-adaptation of features.
9.1.1 Mathematical Formulation
During training: \[\begin{equation} \tilde{h} = m \odot h, \quad m_i \sim \text{Bernoulli}(1-p) \end{equation}\] where \(h\) is the layer output, \(m\) is a binary mask, and \(p\) is the dropout rate.
During inference: \[\begin{equation} \tilde{h} = (1-p) \cdot h \end{equation}\]
Critical Interview Question: Why scale by \((1-p)\) at test time?
Answer: During training, each neuron is active with probability \((1-p)\). At test time, all neurons are active, so the expected output is \((1-p)\) times larger than during training. We scale down to match training expectations.
Detailed explanation:
Training: \(\mathbb{E}[\tilde{h}_i] = (1-p) \cdot h_i\) (neuron active with prob \(1-p\))
Test (no dropout): Output is \(h_i\) (always active)
To match expectations: multiply by \((1-p)\) at test time
Alternative (Inverted Dropout): Scale up during training by \(\frac{1}{1-p}\), then use outputs as-is at test time: \[\begin{equation} \text{Training: } \tilde{h} = \frac{1}{1-p} (m \odot h), \quad \text{Inference: } \tilde{h} = h \end{equation}\]
Why inverted dropout is preferred: Most modern frameworks (PyTorch, TensorFlow) use inverted dropout to avoid test-time computation. This makes inference faster and removes the need to remember to scale.
9.1.2 Why Dropout Works
Ensemble interpretation: Training with dropout samples \(2^n\) different sub-networks (where \(n\) is the number of neurons). At test time, we approximate the ensemble average.
Co-adaptation prevention: Neurons cannot rely on specific other neurons being present, forcing distributed representations.
Implicit regularization: Acts like \(L^2\) regularization on weights, with strength proportional to dropout rate.
Interview Question: When does dropout hurt performance?
Answer:
Small datasets: Reduces effective training data per iteration, can cause underfitting
Batch normalization: Dropout + BatchNorm can conflict (BatchNorm already regularizes). Use one or the other, or reduce dropout rate.
Recurrent connections: Naive dropout in RNNs disrupts temporal dependencies. Use variational dropout (same mask across timesteps) instead.
Very deep networks with residuals: Skip connections already provide regularization; heavy dropout can hurt
Modern transformers: Standard dropout on activations is less effective. Use attention dropout (on attention weights) and residual dropout (on residual connections) instead.
Interview Question: How would you implement dropout from scratch?
PyTorch-style implementation:
def dropout(x, p=0.5, training=True):
if not training:
return x
# Inverted dropout
mask = (torch.rand_like(x) > p).float()
return x * mask / (1 - p)
Key points:
Check
trainingflag (disable at test time)Generate random mask with same shape as input
Scale by \(\frac{1}{1-p}\) during training (inverted dropout)
No-op during inference
Where to apply dropout:
Fully connected layers: \(p = 0.5\) typical for hidden layers
After activations in CNNs: \(p = 0.2\) to \(0.5\)
Transformers: Attention dropout (\(p = 0.1\)), residual dropout (\(p = 0.1\))
NOT in BatchNorm layers (redundant, can hurt)
NOT in output layer (want stable predictions)
9.2 Batch Normalization
Normalizes layer inputs across the mini-batch:
Training: \[\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y = \gamma \hat{x} + \beta\]
where \(\mu_B, \sigma_B^2\) are batch mean/variance, \(\gamma, \beta\) are learnable parameters.
Inference: Use running averages of \(\mu, \sigma^2\) computed during training (exponential moving average).
Benefits:
Reduces internal covariate shift
Allows higher learning rates (10-100× in some cases)
Provides regularization effect (noise from batch statistics)
Reduces sensitivity to initialization
Placement:
After linear/conv layer, before activation (original paper)
Some architectures use it after activation (ResNet variants)
9.3 Layer Normalization
Normalizes across features (not batch): \[\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad \mu = \frac{1}{d}\sum_{j=1}^d x_j, \quad \sigma^2 = \frac{1}{d}\sum_{j=1}^d (x_j - \mu)^2\]
Key differences from BatchNorm:
Normalizes across features, not batch dimension
Same computation at train and test time (no running averages)
Better for RNNs, transformers (batch size independent)
Works with batch size 1
Use cases:
Transformers: Standard (LayerNorm before/after attention and FFN)
RNNs/LSTMs: Better than BatchNorm for variable-length sequences
Small batch sizes or online learning
9.4 Other Normalization Techniques
Instance Normalization: Normalize per sample, per channel (style transfer, GANs)
Group Normalization: Divide channels into groups, normalize within groups (works well for small batches)
RMSNorm: Simpler variant of LayerNorm, just divide by RMS (used in LLaMA, Mistral): \[\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma, \quad \text{RMS}(x) = \sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2}\]
10 Convolutional Neural Networks (CNNs)
10.1 Convolution Operation
2D Convolution applies learnable filters over spatial dimensions: \[Y[i, j, c_{\text{out}}] = \sum_{c_{\text{in}}=0}^{C_{\text{in}}-1} \sum_{m=0}^{k_h-1}\sum_{n=0}^{k_w-1} W[m, n, c_{\text{in}}, c_{\text{out}}] \cdot X[i+m, j+n, c_{\text{in}}] + b[c_{\text{out}}]\]
where:
Input: \(X \in \mathbb{R}^{H \times W \times C_{\text{in}}}\) (e.g., \(224 \times 224 \times 3\) for RGB)
Weights: \(W \in \mathbb{R}^{k_h \times k_w \times C_{\text{in}} \times C_{\text{out}}}\)
Output: \(Y \in \mathbb{R}^{H' \times W' \times C_{\text{out}}}\)
Each output channel sums over all input channels (depth-wise aggregation)
Key parameters:
Kernel size: \(k \times k\) (typically 3×3, 5×5, 7×7)
Stride: Step size (stride=2 downsamples by 2×)
Padding: Add zeros around input (maintains spatial dimensions)
Dilation: Spacing between kernel elements (receptive field expansion)
Output size: \[H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} + 2P - K}{S} \right\rfloor + 1\]
where \(P\) = padding, \(K\) = kernel size, \(S\) = stride.
10.2 Why Convolutions?
Parameter sharing: Same filter applied everywhere (translation equivariance)
Sparse connectivity: Each output depends on local patch, not full input
Hierarchy: Early layers detect edges, later layers detect objects
Parameters:
Conv layer: \(C_{\text{out}} \times C_{\text{in}} \times k_h \times k_w + C_{\text{out}}\) (weights + biases)
Fully connected: \(n_{\text{in}} \times n_{\text{out}} + n_{\text{out}}\)
For \(224 \times 224 \times 3\) image:
FC layer to 1000 classes: \(224 \times 224 \times 3 \times 1000 = 150M\) parameters
Conv 3×3, 64 filters: \(3 \times 3 \times 3 \times 64 = 1,728\) parameters
10.3 Pooling Layers
Downsample spatial dimensions:
Max Pooling: \[Y[i, j] = \max_{m, n \in \text{window}} X[i \cdot s + m, j \cdot s + n]\]
Average Pooling: \[Y[i, j] = \frac{1}{k^2}\sum_{m, n \in \text{window}} X[i \cdot s + m, j \cdot s + n]\]
Global Average Pooling (GAP): Average over entire spatial dimensions: \(H \times W \times C \rightarrow 1 \times 1 \times C\)
Used in modern architectures (ResNet, Inception) instead of fully connected layers.
10.4 Common CNN Architectures
LeNet-5 (1998): Conv-Pool-Conv-Pool-FC (MNIST)
AlexNet (2012): Deeper, ReLU, Dropout, Data augmentation (ImageNet winner)
VGG (2014): Stacked 3×3 convs, very deep (16-19 layers)
ResNet (2015): Skip connections, 50-152 layers
Inception/GoogLeNet (2014): Multi-scale convolutions in parallel
EfficientNet (2019): Compound scaling (depth, width, resolution)
Interview Question: Why 3×3 convolutions?
Two 3×3 convolutions have same receptive field as one 5×5, but:
Fewer parameters: \(2 \times (3^2 \times C^2) = 18C^2\) vs \(25C^2\)
More non-linearity: 2 ReLUs vs 1
Deeper network for same compute
VGG popularized this; now standard in ResNet, modern CNNs.
11 Recurrent Neural Networks (RNNs)
11.1 Vanilla RNN
Process sequences by maintaining hidden state: \[h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)\] \[y_t = W_{hy} h_t + b_y\]
Unrolled computation: \[h_t = f(h_{t-1}, x_t; \theta), \quad h_0 = 0\]
Problems:
Vanishing gradients: Gradients decay exponentially with sequence length \[\frac{\partial h_t}{\partial h_0} = \prod_{k=1}^t \frac{\partial h_k}{\partial h_{k-1}} = \prod_{k=1}^t W_{hh} \cdot \text{diag}(\tanh'(\cdot))\] If \(\|W_{hh}\| < 1\), gradients vanish; if \(> 1\), explode.
Exploding gradients: Fixed with gradient clipping
11.2 Long Short-Term Memory (LSTM)
LSTM solves vanishing gradients with gating mechanisms and cell state:
Gates: \[\begin{align} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) & & \text{(Forget gate)} \\ i_t & = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) & & \text{(Input gate)} \\ o_t & = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) & & \text{(Output gate)} \\ \tilde{C}_t & = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) & & \text{(Candidate cell state)} \end{align}\]
Cell state update: \[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\]
Hidden state: \[h_t = o_t \odot \tanh(C_t)\]
Key insight: Cell state \(C_t\) has additive path (not multiplicative), allowing gradients to flow without vanishing: \[\frac{\partial C_t}{\partial C_{t-1}} = f_t \quad \text{(element-wise, not matrix multiplication)}\]
Gate functions:
Forget gate \(f_t\): What to remove from cell state (0 = forget all, 1 = keep all)
Input gate \(i_t\): What new information to add
Output gate \(o_t\): What to output from cell state
LSTM Example: Remembering Long-Term Dependencies
Sentence: “The cat, which we found last week in the park, was hungry.”
Input gate: Activates for “cat” (subject)
Cell state: Remembers “cat” through long phrase
Forget gate: Closes after seeing “cat” to prevent interference
Output gate: Opens when predicting “was” to recall singular subject
Result: Correct agreement “was” (singular) not “were” (plural).
11.3 Gated Recurrent Unit (GRU)
GRU simplifies LSTM by merging cell and hidden state:
Gates: \[\begin{align} r_t & = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r) & & \text{(Reset gate)} \\ z_t & = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z) & & \text{(Update gate)} \end{align}\]
Candidate hidden state: \[\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t] + b)\]
Hidden state update: \[h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\]
Comparison to LSTM:
Fewer parameters: 2 gates vs 3, no separate cell state
Faster to train (fewer matrix multiplications)
Performance often comparable to LSTM on many tasks
Update gate \(z_t\) controls forget and input simultaneously
Interview Question: LSTM vs GRU?
LSTM:
More expressive (separate forget/input gates)
Better for complex, long-term dependencies
Standard in NLP before transformers
GRU:
Fewer parameters (faster training, less overfitting)
Often matches LSTM performance
Easier to tune
Rule of thumb: Try GRU first (faster), switch to LSTM if GRU underperforms.
11.4 Bidirectional RNNs
Process sequence in both directions: \[\overrightarrow{h}_t = f(\overrightarrow{h}_{t-1}, x_t), \quad \overleftarrow{h}_t = f(\overleftarrow{h}_{t+1}, x_t)\] \[h_t = [\overrightarrow{h}_t; \overleftarrow{h}_t]\]
Use cases:
Sentence classification, NER, POS tagging (full context available)
NOT for language modeling (future tokens unavailable)
11.5 RNN Variants and Modern Usage
Historical progression:
1990s-2000s: Vanilla RNN (limited by vanishing gradients)
1997-2014: LSTM introduced (1997), popularized by Graves (2005, 2013)
2014: GRU proposed (Cho et al.)
2014-2017: Peak RNN era (seq2seq, attention, NMT)
2017+: Transformers largely replace RNNs for NLP
2020s: RNNs still used in time series, audio, some vision tasks
Where RNNs are still used:
Time series forecasting (stock prices, sensor data)
Audio processing (speech recognition, music generation)
On-device inference (transformers too large)
Online learning (streaming data)
Why transformers replaced RNNs:
Parallelization: Transformers process all tokens simultaneously; RNNs sequential
Long-range dependencies: Attention has constant path length; RNNs have linear path
Gradient flow: Attention has direct connections; RNNs suffer vanishing gradients
12 Practical Considerations
12.1 Choosing Activation Functions
| Use Case | Activation | Reason |
|---|---|---|
| Hidden layers (general) | ReLU, Leaky ReLU | Fast, good gradients |
| Transformers, LLMs | GELU, SiLU | Smooth, empirically better |
| Output (binary classification) | Sigmoid | Probability output |
| Output (multi-class) | Softmax | Probability distribution |
| RNN/LSTM gates | Sigmoid | Gating (0-1 range) |
| RNN/LSTM state | Tanh | Zero-centered |
| Deep CNNs (ResNet) | ReLU | Simple, effective |
| Vision models (EfficientNet) | Swish/SiLU | Better accuracy |
12.2 Regularization Strategy
Modern deep learning stack:
Data augmentation: First line of defense (rotation, crop, color jitter)
Normalization: BatchNorm (CNNs), LayerNorm (transformers)
Dropout: After FC layers (0.5), light in convs (0.2)
Weight decay: L2 regularization via optimizer (AdamW: 0.01-0.1)
Early stopping: Monitor validation loss
Common mistakes:
Using dropout with BatchNorm (redundant, can hurt performance)
Too much dropout in transformers (use attention dropout instead)
Forgetting to set
model.eval()(BatchNorm, Dropout behave differently)
12.3 Architecture Selection
| Task | Architecture | Notes |
|---|---|---|
| Image classification | ResNet, EfficientNet | ResNet50 baseline, EfficientNet for efficiency |
| Object detection | Faster R-CNN, YOLO | Faster R-CNN accuracy, YOLO speed |
| Semantic segmentation | U-Net, DeepLab | U-Net medical, DeepLab general |
| Text classification | BERT, RoBERTa | BERT fine-tune, RoBERTa if more data |
| Text generation | GPT, LLaMA | GPT-style decoder-only |
| Seq2Seq (translation) | Transformer (enc-dec) | mT5, mBART for multilingual |
| Time series | LSTM, GRU, Temporal CNN | LSTM long-term, TCN recent alternative |
| Speech recognition | Conformer, Whisper | Conformer hybrid, Whisper pretrained |
Interview Wisdom: Start Simple
Start with simplest model that could work (logistic regression, small CNN/RNN)
Overfit single batch to verify implementation
Add regularization only after overfitting on full train set
Use pretrained models when available (transfer learning)
Architecture search last resort (expensive, often not needed)
Most improvements come from data quality, feature engineering, and hyperparameter tuning–not fancy architectures.
13 Summary: Building Block Decision Tree
13.1 Quick Reference
1. Activation Function
Hidden layers: ReLU (default), GELU (transformers)
Output layer: Sigmoid (binary), Softmax (multi-class), Linear (regression)
2. Normalization
CNNs: BatchNorm after conv, before activation
Transformers/RNNs: LayerNorm
Small batches: GroupNorm or LayerNorm
3. Regularization
Always: Data augmentation, weight decay
CNNs: Dropout (0.5) on FC layers
Transformers: Attention dropout (0.1), residual dropout
Early stopping on validation set
4. Layer Type
Spatial data (images): Convolutions
Sequential data (modern): Transformers
Sequential data (lightweight): LSTM/GRU
Tabular data: Fully connected (+ embeddings for categoricals)
5. Common Pitfalls
Forgetting
model.eval()at test timeUsing dropout with BatchNorm (pick one or use carefully)
Wrong activation in output (sigmoid for binary, not softmax)
Not clipping gradients in RNNs (exploding gradients)
Insufficient warmup for transformers with LayerNorm