7  Chapter 6: Neural Network Building Blocks

8 Activation Functions

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Without activations, deep networks would collapse to linear models.

8.1 Classic Activations

8.1.1 Sigmoid: \(\sigma(x) = \frac{1}{1 + e^{-x}}\)

Properties:

  • Output range: \((0, 1)\)

  • Derivative: \(\sigma'(x) = \sigma(x)(1 - \sigma(x))\)

  • Saturates at both extremes (gradients \(\rightarrow 0\))

Use cases: Binary classification output layer, gate mechanisms (LSTM)

Problems:

  • Vanishing gradients: For \(|x| > 5\), gradient \(\approx 0\)

  • Not zero-centered: outputs always positive, causing zig-zagging gradient updates

  • Expensive computation (exponential)

8.1.2 Tanh: \(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)

Properties:

  • Output range: \((-1, 1)\)

  • Derivative: \(\tanh'(x) = 1 - \tanh^2(x)\)

  • Zero-centered (better than sigmoid)

  • Still saturates at extremes

Use cases: RNN/LSTM hidden states, traditionally hidden layers

Note

Interview Insight: Tanh is just a scaled sigmoid: \(\tanh(x) = 2\sigma(2x) - 1\). It’s preferred over sigmoid for hidden layers because zero-centered activations make optimization easier.

8.2 Modern Activations

8.2.1 ReLU: \(\text{ReLU}(x) = \max(0, x)\)

Properties:

  • Output range: \([0, \infty)\)

  • Derivative: \(\mathbb{1}_{x > 0}\) (0 if \(x \leq 0\), 1 if \(x > 0\))

  • Does not saturate for positive values

  • Extremely fast to compute

Advantages:

  • Accelerates convergence (6× faster than sigmoid/tanh in AlexNet)

  • Sparse activations (typically 50% neurons are zero)

  • Gradient flow: constant gradient of 1 for active neurons

Problems:

  • Dying ReLU: Neurons can get stuck at zero if they receive large negative gradients. Once \(w \cdot x + b < 0\) for all inputs, neuron never activates again.

  • Not zero-centered

8.2.2 Leaky ReLU: \(\text{LeakyReLU}(x) = \max(\alpha x, x)\)

Fixes dying ReLU by allowing small negative slope (\(\alpha \approx 0.01\)): \[\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}\]

Variants:

  • PReLU (Parametric ReLU): Learn \(\alpha\) during training via backprop

  • RReLU (Randomized ReLU): Sample \(\alpha \sim U(l, u)\) during training, use fixed average during inference

8.2.3 ELU: \(\text{ELU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\)

Exponential Linear Unit (ELU) smooths negative values:

Advantages:

  • Negative saturation pushes mean activations closer to zero (better than Leaky ReLU)

  • Smooth everywhere (differentiable at 0)

  • Robust to noise

Disadvantages:

  • Expensive exponential computation for \(x < 0\)

8.2.4 GELU: \(\text{GELU}(x) = x \cdot \Phi(x)\)

Gaussian Error Linear Unit used in BERT, GPT, and most modern transformers: \[\text{GELU}(x) = x \cdot P(X \leq x), \quad X \sim \mathcal{N}(0, 1)\]

Approximation: \(\text{GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{2/\pi}(x + 0.044715x^3)\right]\right)\)

image

Properties:

  • Smooth, non-monotonic

  • Stochastic regularization effect (weights inputs by their magnitude)

  • Empirically better than ReLU for transformers

8.2.5 Swish/SiLU: \(\text{Swish}(x) = x \cdot \sigma(\beta x)\)

Self-Gated activation discovered by Google AutoML: \[\text{Swish}(x) = \frac{x}{1 + e^{-\beta x}}\]

When \(\beta = 1\), called SiLU (Sigmoid Linear Unit). Used in EfficientNet, modern vision models.

image

Properties:

  • Smooth, non-monotonic

  • Self-gating: activation modulates itself

  • Approaches linear for large positive \(x\), approaches 0 for large negative \(x\)

8.2.6 SwiGLU (Gated MLP)

Key Insight: SwiGLU is not just replacing \(\sigma\) with SiLU. It’s a fundamentally different architecture using gating.

Standard FFN (e.g., GELU): \[\text{FFN}(x) = W_2 \cdot \text{GELU}(x W_1 + b_1) + b_2\] One projection \(\rightarrow\) activation \(\rightarrow\) output projection.

SwiGLU (Gated Linear Unit): \[\text{SwiGLU}(x) = (x W_1) \odot \text{Swish}(x W_2)\] Two parallel projections \(\rightarrow\) element-wise gating (one branch gates the other).

In transformers, the FFN becomes: \[\text{FFN}(x) = W_o(\text{SwiGLU}(x))\]

Properties:

  • Gating gives data-dependent feature selection (more expressive than plain GELU)

  • Often improves quality at similar compute (used in PaLM, LLaMA, Mistral)

  • Extra projection costs more; common trick: reduce hidden size to \(2/3\) of GELU FFN to keep params constant

  • Related variants: GEGLU (GELU gate), SiGLU (sigmoid gate)

  • Why \(W_o\)? SwiGLU operates in the expanded hidden size; \(W_o\) projects back to model dimension for the residual add

TipExample

Activation Function Timeline:

  • 1980s-2000s: Sigmoid, Tanh (neural networks, RNNs)

  • 2010-2012: ReLU revolution (AlexNet, ImageNet breakthrough)

  • 2013-2015: Leaky ReLU, PReLU, ELU (fixing dying ReLU)

  • 2016-2018: GELU, Swish (smooth alternatives for transformers)

  • 2019+: SiLU, Mish (vision models, diffusion models)

Note

Interview Question: Why ReLU over Sigmoid?

Three key reasons:

  1. Gradient flow: ReLU has gradient 1 for active neurons (vs sigmoid’s max 0.25)

  2. Sparsity:  50% neurons zero \(\rightarrow\) efficient representations

  3. Compute: No exponentials, just \(\max(0, x)\)

Trade-off: Dying ReLU problem, mitigated by Leaky ReLU or careful initialization.

9 Regularization Techniques

9.1 Dropout

Dropout randomly sets a fraction \(p\) of neuron activations to zero during training, forcing the network to learn redundant representations and preventing co-adaptation of features.

9.1.1 Mathematical Formulation

During training: \[\begin{equation} \tilde{h} = m \odot h, \quad m_i \sim \text{Bernoulli}(1-p) \end{equation}\] where \(h\) is the layer output, \(m\) is a binary mask, and \(p\) is the dropout rate.

During inference: \[\begin{equation} \tilde{h} = (1-p) \cdot h \end{equation}\]

Note

Critical Interview Question: Why scale by \((1-p)\) at test time?

Answer: During training, each neuron is active with probability \((1-p)\). At test time, all neurons are active, so the expected output is \((1-p)\) times larger than during training. We scale down to match training expectations.

Detailed explanation:

  • Training: \(\mathbb{E}[\tilde{h}_i] = (1-p) \cdot h_i\) (neuron active with prob \(1-p\))

  • Test (no dropout): Output is \(h_i\) (always active)

  • To match expectations: multiply by \((1-p)\) at test time

Alternative (Inverted Dropout): Scale up during training by \(\frac{1}{1-p}\), then use outputs as-is at test time: \[\begin{equation} \text{Training: } \tilde{h} = \frac{1}{1-p} (m \odot h), \quad \text{Inference: } \tilde{h} = h \end{equation}\]

Why inverted dropout is preferred: Most modern frameworks (PyTorch, TensorFlow) use inverted dropout to avoid test-time computation. This makes inference faster and removes the need to remember to scale.

9.1.2 Why Dropout Works

  • Ensemble interpretation: Training with dropout samples \(2^n\) different sub-networks (where \(n\) is the number of neurons). At test time, we approximate the ensemble average.

  • Co-adaptation prevention: Neurons cannot rely on specific other neurons being present, forcing distributed representations.

  • Implicit regularization: Acts like \(L^2\) regularization on weights, with strength proportional to dropout rate.

Note

Interview Question: When does dropout hurt performance?

Answer:

  • Small datasets: Reduces effective training data per iteration, can cause underfitting

  • Batch normalization: Dropout + BatchNorm can conflict (BatchNorm already regularizes). Use one or the other, or reduce dropout rate.

  • Recurrent connections: Naive dropout in RNNs disrupts temporal dependencies. Use variational dropout (same mask across timesteps) instead.

  • Very deep networks with residuals: Skip connections already provide regularization; heavy dropout can hurt

  • Modern transformers: Standard dropout on activations is less effective. Use attention dropout (on attention weights) and residual dropout (on residual connections) instead.

Note

Interview Question: How would you implement dropout from scratch?

PyTorch-style implementation:

def dropout(x, p=0.5, training=True):
    if not training:
        return x
    # Inverted dropout
    mask = (torch.rand_like(x) > p).float()
    return x * mask / (1 - p)

Key points:

  • Check training flag (disable at test time)

  • Generate random mask with same shape as input

  • Scale by \(\frac{1}{1-p}\) during training (inverted dropout)

  • No-op during inference

Where to apply dropout:

  • Fully connected layers: \(p = 0.5\) typical for hidden layers

  • After activations in CNNs: \(p = 0.2\) to \(0.5\)

  • Transformers: Attention dropout (\(p = 0.1\)), residual dropout (\(p = 0.1\))

  • NOT in BatchNorm layers (redundant, can hurt)

  • NOT in output layer (want stable predictions)

9.2 Batch Normalization

Normalizes layer inputs across the mini-batch:

Training: \[\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y = \gamma \hat{x} + \beta\]

where \(\mu_B, \sigma_B^2\) are batch mean/variance, \(\gamma, \beta\) are learnable parameters.

Inference: Use running averages of \(\mu, \sigma^2\) computed during training (exponential moving average).

Benefits:

  • Reduces internal covariate shift

  • Allows higher learning rates (10-100× in some cases)

  • Provides regularization effect (noise from batch statistics)

  • Reduces sensitivity to initialization

Placement:

  • After linear/conv layer, before activation (original paper)

  • Some architectures use it after activation (ResNet variants)

9.3 Layer Normalization

Normalizes across features (not batch): \[\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}, \quad \mu = \frac{1}{d}\sum_{j=1}^d x_j, \quad \sigma^2 = \frac{1}{d}\sum_{j=1}^d (x_j - \mu)^2\]

Key differences from BatchNorm:

  • Normalizes across features, not batch dimension

  • Same computation at train and test time (no running averages)

  • Better for RNNs, transformers (batch size independent)

  • Works with batch size 1

Use cases:

  • Transformers: Standard (LayerNorm before/after attention and FFN)

  • RNNs/LSTMs: Better than BatchNorm for variable-length sequences

  • Small batch sizes or online learning

9.4 Other Normalization Techniques

  • Instance Normalization: Normalize per sample, per channel (style transfer, GANs)

  • Group Normalization: Divide channels into groups, normalize within groups (works well for small batches)

  • RMSNorm: Simpler variant of LayerNorm, just divide by RMS (used in LLaMA, Mistral): \[\text{RMSNorm}(x) = \frac{x}{\text{RMS}(x)} \cdot \gamma, \quad \text{RMS}(x) = \sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2}\]

10 Convolutional Neural Networks (CNNs)

10.1 Convolution Operation

2D Convolution applies learnable filters over spatial dimensions: \[Y[i, j, c_{\text{out}}] = \sum_{c_{\text{in}}=0}^{C_{\text{in}}-1} \sum_{m=0}^{k_h-1}\sum_{n=0}^{k_w-1} W[m, n, c_{\text{in}}, c_{\text{out}}] \cdot X[i+m, j+n, c_{\text{in}}] + b[c_{\text{out}}]\]

where:

  • Input: \(X \in \mathbb{R}^{H \times W \times C_{\text{in}}}\) (e.g., \(224 \times 224 \times 3\) for RGB)

  • Weights: \(W \in \mathbb{R}^{k_h \times k_w \times C_{\text{in}} \times C_{\text{out}}}\)

  • Output: \(Y \in \mathbb{R}^{H' \times W' \times C_{\text{out}}}\)

  • Each output channel sums over all input channels (depth-wise aggregation)

Key parameters:

  • Kernel size: \(k \times k\) (typically 3×3, 5×5, 7×7)

  • Stride: Step size (stride=2 downsamples by 2×)

  • Padding: Add zeros around input (maintains spatial dimensions)

  • Dilation: Spacing between kernel elements (receptive field expansion)

Output size: \[H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} + 2P - K}{S} \right\rfloor + 1\]

where \(P\) = padding, \(K\) = kernel size, \(S\) = stride.

10.2 Why Convolutions?

  1. Parameter sharing: Same filter applied everywhere (translation equivariance)

  2. Sparse connectivity: Each output depends on local patch, not full input

  3. Hierarchy: Early layers detect edges, later layers detect objects

Parameters:

  • Conv layer: \(C_{\text{out}} \times C_{\text{in}} \times k_h \times k_w + C_{\text{out}}\) (weights + biases)

  • Fully connected: \(n_{\text{in}} \times n_{\text{out}} + n_{\text{out}}\)

For \(224 \times 224 \times 3\) image:

  • FC layer to 1000 classes: \(224 \times 224 \times 3 \times 1000 = 150M\) parameters

  • Conv 3×3, 64 filters: \(3 \times 3 \times 3 \times 64 = 1,728\) parameters

10.3 Pooling Layers

Downsample spatial dimensions:

Max Pooling: \[Y[i, j] = \max_{m, n \in \text{window}} X[i \cdot s + m, j \cdot s + n]\]

Average Pooling: \[Y[i, j] = \frac{1}{k^2}\sum_{m, n \in \text{window}} X[i \cdot s + m, j \cdot s + n]\]

Global Average Pooling (GAP): Average over entire spatial dimensions: \(H \times W \times C \rightarrow 1 \times 1 \times C\)

Used in modern architectures (ResNet, Inception) instead of fully connected layers.

10.4 Common CNN Architectures

  • LeNet-5 (1998): Conv-Pool-Conv-Pool-FC (MNIST)

  • AlexNet (2012): Deeper, ReLU, Dropout, Data augmentation (ImageNet winner)

  • VGG (2014): Stacked 3×3 convs, very deep (16-19 layers)

  • ResNet (2015): Skip connections, 50-152 layers

  • Inception/GoogLeNet (2014): Multi-scale convolutions in parallel

  • EfficientNet (2019): Compound scaling (depth, width, resolution)

Note

Interview Question: Why 3×3 convolutions?

Two 3×3 convolutions have same receptive field as one 5×5, but:

  • Fewer parameters: \(2 \times (3^2 \times C^2) = 18C^2\) vs \(25C^2\)

  • More non-linearity: 2 ReLUs vs 1

  • Deeper network for same compute

VGG popularized this; now standard in ResNet, modern CNNs.

11 Recurrent Neural Networks (RNNs)

11.1 Vanilla RNN

Process sequences by maintaining hidden state: \[h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)\] \[y_t = W_{hy} h_t + b_y\]

Unrolled computation: \[h_t = f(h_{t-1}, x_t; \theta), \quad h_0 = 0\]

Problems:

  1. Vanishing gradients: Gradients decay exponentially with sequence length \[\frac{\partial h_t}{\partial h_0} = \prod_{k=1}^t \frac{\partial h_k}{\partial h_{k-1}} = \prod_{k=1}^t W_{hh} \cdot \text{diag}(\tanh'(\cdot))\] If \(\|W_{hh}\| < 1\), gradients vanish; if \(> 1\), explode.

  2. Exploding gradients: Fixed with gradient clipping

11.2 Long Short-Term Memory (LSTM)

LSTM solves vanishing gradients with gating mechanisms and cell state:

Gates: \[\begin{align} f_t & = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) & & \text{(Forget gate)} \\ i_t & = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) & & \text{(Input gate)} \\ o_t & = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) & & \text{(Output gate)} \\ \tilde{C}_t & = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) & & \text{(Candidate cell state)} \end{align}\]

Cell state update: \[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\]

Hidden state: \[h_t = o_t \odot \tanh(C_t)\]

Key insight: Cell state \(C_t\) has additive path (not multiplicative), allowing gradients to flow without vanishing: \[\frac{\partial C_t}{\partial C_{t-1}} = f_t \quad \text{(element-wise, not matrix multiplication)}\]

Gate functions:

  • Forget gate \(f_t\): What to remove from cell state (0 = forget all, 1 = keep all)

  • Input gate \(i_t\): What new information to add

  • Output gate \(o_t\): What to output from cell state

TipExample

LSTM Example: Remembering Long-Term Dependencies

Sentence: “The cat, which we found last week in the park, was hungry.”

  • Input gate: Activates for “cat” (subject)

  • Cell state: Remembers “cat” through long phrase

  • Forget gate: Closes after seeing “cat” to prevent interference

  • Output gate: Opens when predicting “was” to recall singular subject

Result: Correct agreement “was” (singular) not “were” (plural).

11.3 Gated Recurrent Unit (GRU)

GRU simplifies LSTM by merging cell and hidden state:

Gates: \[\begin{align} r_t & = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r) & & \text{(Reset gate)} \\ z_t & = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z) & & \text{(Update gate)} \end{align}\]

Candidate hidden state: \[\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t] + b)\]

Hidden state update: \[h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\]

Comparison to LSTM:

  • Fewer parameters: 2 gates vs 3, no separate cell state

  • Faster to train (fewer matrix multiplications)

  • Performance often comparable to LSTM on many tasks

  • Update gate \(z_t\) controls forget and input simultaneously

Note

Interview Question: LSTM vs GRU?

LSTM:

  • More expressive (separate forget/input gates)

  • Better for complex, long-term dependencies

  • Standard in NLP before transformers

GRU:

  • Fewer parameters (faster training, less overfitting)

  • Often matches LSTM performance

  • Easier to tune

Rule of thumb: Try GRU first (faster), switch to LSTM if GRU underperforms.

11.4 Bidirectional RNNs

Process sequence in both directions: \[\overrightarrow{h}_t = f(\overrightarrow{h}_{t-1}, x_t), \quad \overleftarrow{h}_t = f(\overleftarrow{h}_{t+1}, x_t)\] \[h_t = [\overrightarrow{h}_t; \overleftarrow{h}_t]\]

Use cases:

  • Sentence classification, NER, POS tagging (full context available)

  • NOT for language modeling (future tokens unavailable)

11.5 RNN Variants and Modern Usage

Historical progression:

  1. 1990s-2000s: Vanilla RNN (limited by vanishing gradients)

  2. 1997-2014: LSTM introduced (1997), popularized by Graves (2005, 2013)

  3. 2014: GRU proposed (Cho et al.)

  4. 2014-2017: Peak RNN era (seq2seq, attention, NMT)

  5. 2017+: Transformers largely replace RNNs for NLP

  6. 2020s: RNNs still used in time series, audio, some vision tasks

Where RNNs are still used:

  • Time series forecasting (stock prices, sensor data)

  • Audio processing (speech recognition, music generation)

  • On-device inference (transformers too large)

  • Online learning (streaming data)

Why transformers replaced RNNs:

  • Parallelization: Transformers process all tokens simultaneously; RNNs sequential

  • Long-range dependencies: Attention has constant path length; RNNs have linear path

  • Gradient flow: Attention has direct connections; RNNs suffer vanishing gradients

12 Practical Considerations

12.1 Choosing Activation Functions

Activation function selection guide
Use Case Activation Reason
Hidden layers (general) ReLU, Leaky ReLU Fast, good gradients
Transformers, LLMs GELU, SiLU Smooth, empirically better
Output (binary classification) Sigmoid Probability output
Output (multi-class) Softmax Probability distribution
RNN/LSTM gates Sigmoid Gating (0-1 range)
RNN/LSTM state Tanh Zero-centered
Deep CNNs (ResNet) ReLU Simple, effective
Vision models (EfficientNet) Swish/SiLU Better accuracy

12.2 Regularization Strategy

Modern deep learning stack:

  1. Data augmentation: First line of defense (rotation, crop, color jitter)

  2. Normalization: BatchNorm (CNNs), LayerNorm (transformers)

  3. Dropout: After FC layers (0.5), light in convs (0.2)

  4. Weight decay: L2 regularization via optimizer (AdamW: 0.01-0.1)

  5. Early stopping: Monitor validation loss

Common mistakes:

  • Using dropout with BatchNorm (redundant, can hurt performance)

  • Too much dropout in transformers (use attention dropout instead)

  • Forgetting to set model.eval() (BatchNorm, Dropout behave differently)

12.3 Architecture Selection

Architecture selection by task
Task Architecture Notes
Image classification ResNet, EfficientNet ResNet50 baseline, EfficientNet for efficiency
Object detection Faster R-CNN, YOLO Faster R-CNN accuracy, YOLO speed
Semantic segmentation U-Net, DeepLab U-Net medical, DeepLab general
Text classification BERT, RoBERTa BERT fine-tune, RoBERTa if more data
Text generation GPT, LLaMA GPT-style decoder-only
Seq2Seq (translation) Transformer (enc-dec) mT5, mBART for multilingual
Time series LSTM, GRU, Temporal CNN LSTM long-term, TCN recent alternative
Speech recognition Conformer, Whisper Conformer hybrid, Whisper pretrained
Note

Interview Wisdom: Start Simple

  1. Start with simplest model that could work (logistic regression, small CNN/RNN)

  2. Overfit single batch to verify implementation

  3. Add regularization only after overfitting on full train set

  4. Use pretrained models when available (transfer learning)

  5. Architecture search last resort (expensive, often not needed)

Most improvements come from data quality, feature engineering, and hyperparameter tuning–not fancy architectures.

13 Summary: Building Block Decision Tree

13.1 Quick Reference

1. Activation Function

  • Hidden layers: ReLU (default), GELU (transformers)

  • Output layer: Sigmoid (binary), Softmax (multi-class), Linear (regression)

2. Normalization

  • CNNs: BatchNorm after conv, before activation

  • Transformers/RNNs: LayerNorm

  • Small batches: GroupNorm or LayerNorm

3. Regularization

  • Always: Data augmentation, weight decay

  • CNNs: Dropout (0.5) on FC layers

  • Transformers: Attention dropout (0.1), residual dropout

  • Early stopping on validation set

4. Layer Type

  • Spatial data (images): Convolutions

  • Sequential data (modern): Transformers

  • Sequential data (lightweight): LSTM/GRU

  • Tabular data: Fully connected (+ embeddings for categoricals)

5. Common Pitfalls

  • Forgetting model.eval() at test time

  • Using dropout with BatchNorm (pick one or use carefully)

  • Wrong activation in output (sigmoid for binary, not softmax)

  • Not clipping gradients in RNNs (exploding gradients)

  • Insufficient warmup for transformers with LayerNorm