2  Chapter 1: Machine Learning Foundations

3 Statistical Learning Framework

3.1 The Learning Problem

Given dataset \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n\) drawn i.i.d. from unknown distribution \(\mathcal{P}(X, Y)\), learn function \(f: \mathcal{X} \to \mathcal{Y}\) that minimizes expected risk: \[R(f) = \mathbb{E}(\ell(f(X), Y)) = \int \ell(f(x), y) \, dP(x,y)\]

Problem: \(\mathcal{P}\) is unknown. Solution: Empirical Risk Minimization (ERM): \[\hat{R}_n(f) = \frac{1}{n} \sum_{i=1}^n \ell(f(x_i), y_i)\]

Find \(\hat{f} = \operatorname*{arg\,min}_{f \in \mathcal{F}} \hat{R}_n(f)\) where \(\mathcal{F}\) is the hypothesis class.

3.2 Bias-Variance Decomposition

For regression with squared loss \(\ell(y, \hat{y}) = (y - \hat{y})^2\), the expected test error decomposes as: \[\mathbb{E}((Y - \hat{f}(X))^2) = \underbrace{\mathbb{E}((f^*(X) - \mathbb{E}(\hat{f}(X) \mid X))^2)}_{\text{Bias}^2} + \underbrace{\mathbb{E}(\text{Var}(\hat{f}(X) \mid X))}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible Error}}\]

where \(f^*(X) = \mathbb{E}(Y \mid X)\) is the Bayes optimal predictor, outer expectation is over test point \(X\), and inner expectation/variance is over training sets.

Bias-Variance Trade-off:

  • High capacity (complex \(\mathcal{F}\)): Low bias, high variance \(\to\) overfitting

  • Low capacity (simple \(\mathcal{F}\)): High bias, low variance \(\to\) underfitting

  • Regularization: Constrains \(\mathcal{F}\) to balance bias and variance

Note

Interview Insight: Bias-variance trade-off explains why more parameters don’t always improve test performance. Regularization (L2, L1, dropout) reduces variance at cost of slight bias increase.

4 Linear Algebra Essentials

4.1 Matrix Decompositions

4.1.1 Singular Value Decomposition (SVD)

Any matrix \(A \in \mathbb{R}^{m \times n}\) can be factored as: \[A = U \Sigma V^T\]

where \(U \in \mathbb{R}^{m \times m}\) (left singular vectors), \(\Sigma \in \mathbb{R}^{m \times n}\) (diagonal matrix of singular values \(\sigma_1 \ge \sigma_2 \ge \cdots \ge 0\)), \(V \in \mathbb{R}^{n \times n}\) (right singular vectors).

Properties:

  • \(U\) and \(V\) are orthogonal: \(U^TU = I\), \(V^TV = I\)

  • Columns of \(U\) are eigenvectors of \(AA^T\) (span row space)

  • Columns of \(V\) are eigenvectors of \(A^TA\) (span column space)

  • Singular values \(\sigma_i = \sqrt{\lambda_i(A^TA)} = \sqrt{\lambda_i(AA^T)}\)

  • Rank: Number of non-zero singular values

  • Moore-Penrose pseudoinverse: \(A^+ = V\Sigma^+ U^T\) where \(\Sigma^+\) inverts non-zero entries

Applications:

  • Low-rank approximation: Best rank-\(k\) approximation is \(A_k = \sum_{i=1}^k \sigma_i u_i v_i^T\) (Eckart-Young theorem)

  • PCA (for data matrix \(X \in \mathbb{R}^{n \times d}\)): Principal component directions are columns of \(V \in \mathbb{R}^{d \times d}\), projected data is \(XV = U\Sigma\), explained variance is \(\sigma_i^2/n\)

  • Least squares: \(\hat{x} = A^+ b\) minimizes \(\|Ax - b\|^2\) even when \(A^TA\) is singular

  • Matrix norms: \(\|A\|_2 = \sigma_1\) (spectral norm), \(\|A\|_F = \sqrt{\sum \sigma_i^2}\) (Frobenius norm)

TipExample

SVD for Image Compression

For image matrix \(A \in \mathbb{R}^{m \times n}\), keeping top \(k\) singular values: \[A_k = \sum_{i=1}^k \sigma_i u_i v_i^T\]

Storage: \(k(m+n+1)\) vs \(mn\) (full). For \(m=n=1000\), \(k=50\): 95% compression with minimal visual loss.

4.1.2 Eigendecomposition

For square matrix \(A \in \mathbb{R}^{n \times n}\): \[A = Q \Lambda Q^{-1}\]

where \(Q\) contains eigenvectors (columns), \(\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)\) contains eigenvalues.

Special case (symmetric \(A = A^T\)): \[A = Q \Lambda Q^T \quad \text{(orthogonal diagonalization)}\]

\(Q\) is orthogonal, eigenvectors are orthonormal.

Properties:

  • \(\text{tr}(A) = \sum \lambda_i\) (sum of eigenvalues)

  • \(\det(A) = \prod \lambda_i\) (product of eigenvalues)

  • \(A\) is positive definite iff all \(\lambda_i > 0\) (e.g., covariance matrices)

  • \(A\) is positive semidefinite iff all \(\lambda_i \ge 0\)

  • Powers: \(A^k = Q \Lambda^k Q^{-1}\) (efficient for large \(k\))

Applications:

  • PCA: For data \(X \in \mathbb{R}^{n \times d}\), covariance \(\Sigma = \frac{1}{n}X^TX = V\Lambda V^T\) where \(V\) contains principal component directions

  • Spectral clustering: Eigenvectors of graph Laplacian

  • Stability analysis: Eigenvalues of Hessian determine local minima/maxima

4.1.3 QR Decomposition

Any matrix \(A \in \mathbb{R}^{m \times n}\) (with \(m \ge n\)) can be factored as: \[A = QR\]

where \(Q \in \mathbb{R}^{m \times n}\) has orthonormal columns (\(Q^TQ = I\)), \(R \in \mathbb{R}^{n \times n}\) is upper triangular.

Algorithms:

  • Gram-Schmidt: Orthogonalize columns of \(A\) sequentially (numerically unstable)

  • Householder reflections: Use reflections \(H_i = I - 2vv^T\) to zero below diagonal (stable, \(O(mn^2)\))

  • Givens rotations: Use 2D rotations to zero individual elements (sparse matrices)

Applications:

  • Least squares: Solve \(Ax = b\) via \(Rx = Q^Tb\) (back-substitution on \(R\))

  • Eigenvalue algorithms: QR iteration for finding eigenvalues

  • Numerical stability: More stable than normal equations \((A^TA)^{-1}A^Tb\) for ill-conditioned \(A\)

4.1.4 Cholesky Decomposition

For symmetric positive definite \(A = A^T\) with \(\lambda_i > 0\): \[A = LL^T\]

where \(L\) is lower triangular with positive diagonal entries.

Properties:

  • Unique decomposition (if diagonal of \(L\) is positive)

  • Efficient: \(O(n^3/3)\) vs \(O(n^3)\) for LU

  • Numerically stable for positive definite matrices

Applications:

  • Covariance matrices: Invert \(\Sigma^{-1}\) efficiently via \(L\)

  • Gaussian sampling: Sample \(z \sim \mathcal{N}(0, \Sigma)\) via \(z = L\epsilon\) where \(\epsilon \sim \mathcal{N}(0,I)\)

  • Optimization: Hessian inversion in Newton’s method

4.1.5 LU Decomposition

For square matrix \(A \in \mathbb{R}^{n \times n}\): \[PA = LU\]

where \(P\) is a permutation matrix (for numerical stability), \(L\) is lower triangular, \(U\) is upper triangular.

Applications:

  • Solve \(Ax = b\) via forward/back substitution: \(Ly = Pb\), then \(Ux = y\)

  • Compute determinant: \(\det(A) = \det(P) \prod u_{ii}\)

  • Multiple right-hand sides: Reuse \(LU\) for different \(b\)

4.2 Key Concepts

4.2.1 Matrix Norms

Vector Norms:

  • \(\|x\|_1 = \sum_i |x_i|\) (L1 norm, sparse solutions)

  • \(\|x\|_2 = \sqrt{\sum_i x_i^2}\) (L2 norm, Euclidean distance)

  • \(\|x\|_\infty = \max_i |x_i|\) (infinity norm, worst-case)

Matrix Norms:

  • Frobenius norm: \(\|A\|_F = \sqrt{\sum_{ij} a_{ij}^2} = \sqrt{\text{tr}(A^TA)} = \sqrt{\sum_i \sigma_i^2}\)

  • Spectral norm (operator norm): \(\|A\|_2 = \sigma_{\max}(A)\) (largest singular value)

  • Nuclear norm: \(\|A\|_* = \sum_i \sigma_i\) (sum of singular values, convex relaxation of rank)

4.2.2 Condition Number

\[\kappa(A) = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)} = \|A\|_2 \|A^{-1}\|_2\]

Interpretation:

  • Measures sensitivity of \(Ax = b\) solution to perturbations in \(A\) or \(b\)

  • \(\kappa(A) \approx 1\): Well-conditioned (stable)

  • \(\kappa(A) \gg 1\): Ill-conditioned (numerically unstable, small changes in input cause large changes in output)

  • Ridge regression adds \(\lambda I\) to reduce condition number: \(\kappa(A^TA + \lambda I) < \kappa(A^TA)\)

4.2.3 Rank and Subspaces

Four Fundamental Subspaces:

  • Column space \(\text{col}(A) \subseteq \mathbb{R}^m\): Span of columns, dimension = rank\((A)\)

  • Row space \(\text{row}(A) \subseteq \mathbb{R}^n\): Span of rows, dimension = rank\((A)\)

  • Null space \(\text{null}(A) \subseteq \mathbb{R}^n\): \(\{x : Ax = 0\}\), dimension = \(n - \text{rank}(A)\)

  • Left null space \(\text{null}(A^T) \subseteq \mathbb{R}^m\): \(\{y : A^Ty = 0\}\), dimension = \(m - \text{rank}(A)\)

Rank-Nullity Theorem: \[\text{rank}(A) + \text{dim}(\text{null}(A)) = n\]

4.2.4 Positive Definite Matrices

\(A\) is positive definite (PD) if \(x^TAx > 0\) for all \(x \ne 0\). (\(A\) is assumed symmetric, but if not, consider the symmetric part \((A + A^T)/2\).)

Equivalent Conditions:

  • All eigenvalues \(\lambda_i > 0\)

  • All leading principal minors (determinants of top-left submatrices) are positive

  • \(A = B^TB\) for some invertible \(B\)

  • Cholesky decomposition exists: \(A = LL^T\)

Positive Semidefinite (PSD): \(x^TAx \ge 0\) (allows \(\lambda_i = 0\)). Covariance matrices are always PSD.

4.3 Matrix Calculus

Gradient (scalar function w.r.t. vector): \[\nabla_x f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}, \quad \in \mathbb{R}^{n \times 1}\]

Jacobian (vector function w.r.t. vector): \[J_{ij} = \frac{\partial f_i}{\partial x_j}, \quad J \in \mathbb{R}^{m \times n}\]

Hessian (second derivatives): \[H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}, \quad H \in \mathbb{R}^{n \times n} \text{ (symmetric)}\]

Common Identities:

  • \(\nabla_x (a^Tx) = a\)

  • \(\nabla_x (x^TAx) = (A + A^T)x\) (if \(A\) symmetric: \(2Ax\))

  • \(\nabla_X \text{tr}(AX) = A^T\)

  • \(\nabla_X \text{tr}(X^TAX) = (A + A^T)X\) (if \(A\) symmetric: \(2AX\))

  • \(\nabla_X \log\det(X) = X^{-T}\) (for invertible \(X\))

4.4 ML Applications

TipExample

OLS via Different Decompositions

Solve \(\min_\theta \|X\theta - y\|^2\):

1. Normal Equations: \(\hat{\theta} = (X^TX)^{-1}X^Ty\)

  • Fast (\(O(nd^2)\)), but numerically unstable if \(\kappa(X^TX)\) large

2. QR Decomposition: \(X = QR \Rightarrow \hat{\theta} = R^{-1}Q^Ty\)

  • More stable (\(O(nd^2)\)), no squaring of condition number

3. SVD: \(X = U\Sigma V^T \Rightarrow \hat{\theta} = V\Sigma^+ U^Ty\)

  • Most stable, handles rank-deficient \(X\), reveals multicollinearity

  • Slower (\(O(nd^2 + d^3)\)), but essential for ill-conditioned problems

TipExample

PCA: SVD vs Eigendecomposition

Given centered data \(X \in \mathbb{R}^{n \times d}\) (n samples, d features):

Convention: Rows are samples, columns are features. Covariance is \(\frac{1}{n}X^TX \in \mathbb{R}^{d \times d}\) (feature covariance). If columns were samples (\(X \in \mathbb{R}^{d \times n}\)), use \(\frac{1}{n}XX^T\) instead. Covariance matrix is always \(d \times d\) (feature space).

Method 1 (Covariance eigendecomposition):

  1. Compute covariance \(\Sigma = \frac{1}{n}X^TX \in \mathbb{R}^{d \times d}\)

  2. Eigendecomposition: \(\Sigma = V\Lambda V^T\)

  3. Principal components: columns of \(V\) (eigenvectors in \(\mathbb{R}^d\))

  4. Explained variance: diagonal of \(\Lambda\)

Method 2 (SVD of data):

  1. Compute \(X = U\Sigma_{\text{diag}} V^T\) where \(U \in \mathbb{R}^{n \times n}\), \(V \in \mathbb{R}^{d \times d}\)

  2. Principal components: columns of \(V\) (same as Method 1)

  3. Projected data: \(XV = U\Sigma_{\text{diag}} \in \mathbb{R}^{n \times d}\)

  4. Explained variance: \(\sigma_i^2 / n\)

Connection: \(\frac{1}{n}X^TX = \frac{1}{n}V\Sigma_{\text{diag}}^2 V^T\), so columns of \(V\) are eigenvectors of covariance with eigenvalues \(\lambda_i = \sigma_i^2/n\).

Why SVD? More numerically stable, avoids forming \(X^TX\) (which squares condition number).

Note

Interview Questions:

Q: When would you use QR vs SVD for least squares?

A: QR (Householder) is faster and sufficient for well-conditioned full-rank problems. For rank-deficient or ill-conditioned cases, use QR with column pivoting (RRQR) or SVD; SVD is most robust and gives singular values, rank, null space, and pseudoinverse.

Q: Why is ridge regression more stable than OLS?

A: Adding \(\lambda I\) to \(X^TX\) ensures all eigenvalues \(\ge \lambda\), reducing condition number: \(\kappa(X^TX + \lambda I) \le \frac{\sigma_{\max}^2 + \lambda}{\lambda}\). This prevents numerical instability from near-zero eigenvalues.

5 Linear Models & Regularization

5.1 Jointly Gaussian Variables

For random vectors \((X, Y)\) following a multivariate Gaussian distribution: \[\begin{bmatrix} X \\ Y \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mu_X \\ \mu_Y \end{bmatrix}, \begin{bmatrix} \Sigma_{XX} & \Sigma_{XY} \\ \Sigma_{YX} & \Sigma_{YY} \end{bmatrix}\right)\]

Key Result (Conditional Distribution): The conditional distribution of \(Y\) given \(X = x\) is Gaussian: \[Y \mid X = x \sim \mathcal{N}(\mu_{Y|X}, \Sigma_{Y|X})\] where: \[\begin{align*} \mu_{Y|X} & = \mathbb{E}(Y \mid X = x) = \mu_Y + \Sigma_{YX}\Sigma_{XX}^{-1}(x - \mu_X) \\ \Sigma_{Y|X} & = \Sigma_{YY} - \Sigma_{YX}\Sigma_{XX}^{-1}\Sigma_{XY} \end{align*}\]

Simplified Case (Zero Mean): If \(\mu_X = \mu_Y = 0\): \[\mathbb{E}(Y \mid X) = \Sigma_{YX}\Sigma_{XX}^{-1}X = \beta^* X\] where \(\beta^* = \Sigma_{YX}\Sigma_{XX}^{-1}\) is the population regression coefficient.

Connection to Linear Regression:

  • Best linear predictor: For jointly Gaussian \((X, Y)\), the conditional expectation \(\mathbb{E}(Y \mid X)\) is exactly linear in \(X\)

  • OLS estimates this: Sample estimate \(\hat{\beta}_{\text{OLS}} = (X^\top X)^{-1}X^\top y\) converges to \(\beta^* = \Sigma_{YX}\Sigma_{XX}^{-1}\) as \(n \to \infty\)

  • Why OLS works: Even if \((X, Y)\) not Gaussian, \(\mathbb{E}(Y \mid X)\) may still be approximately linear, and OLS finds the best linear approximation

  • Non-Gaussian case: If \((X, Y)\) not jointly Gaussian, \(\mathbb{E}(Y \mid X)\) may be nonlinear, but \(\beta^* X\) is still the best linear predictor (minimizes \(\mathbb{E}((Y - \beta X)^2)\))

Note

Interview Insight: The formula \(\mathbb{E}(Y \mid X) = \Sigma_{YX}\Sigma_{XX}^{-1}X\) explains why linear regression works for Gaussian data: it’s the exact conditional expectation. For non-Gaussian data, OLS still finds the best linear approximation, but may miss important nonlinear relationships (which neural networks can capture).

5.2 Ordinary Least Squares (OLS)

Model: \(y = X\theta + \varepsilon\) where \(\varepsilon \sim \mathcal{N}(0, \sigma^2 I)\), \(X \in \mathbb{R}^{n \times d}\), \(\theta \in \mathbb{R}^d\).

Objective: Minimize squared loss: \[L(\theta) = \|X\theta - y\|^2 = (X\theta - y)^\top(X\theta - y)\]

Solution: Set \(\nabla_\theta L = 2X^\top(X\theta - y) = 0\) \(\to\) Normal Equations: \[X^\top X \theta = X^\top y \quad \Rightarrow \quad \hat{\theta}_{\text{OLS}} = (X^\top X)^{-1} X^\top y\]

Geometric Interpretation: \(\hat{y} = X\hat{\theta}\) is the orthogonal projection of \(y\) onto \(\text{col}(X)\).

Statistical Interpretation: Under Gaussian noise, OLS is the Maximum Likelihood Estimator (MLE): \[\hat{\theta}_{\text{MLE}} = \operatorname*{arg\,max}_\theta \, p(y \mid X, \theta) = \operatorname*{arg\,max}_\theta \exp\left(-\frac{1}{2\sigma^2}\|X\theta - y\|^2\right)\]

5.3 Ridge Regression (L2 Regularization / Shrinkage)

Objective: Add penalty for large weights: \[L(\theta) = \|X\theta - y\|^2 + \lambda \|\theta\|^2\]

Closed-Form Solution: \[\hat{\theta}_{\text{ridge}} = (X^\top X + \lambda I)^{-1} X^\top y\]

Key Properties:

  • \(\lambda I\) ensures \(X^\top X + \lambda I\) is invertible even if \(X^\top X\) is singular

  • Bayesian view: Ridge is MAP estimate with Gaussian prior \(\theta \sim \mathcal{N}(0, \frac{1}{\lambda}I)\)

  • Shrinks all coefficients toward zero but never sets them exactly to zero

  • Works well when features are correlated (multicollinearity)

MAP Reconciliation (Gaussian): If \(y \mid \theta \sim \mathcal{N}(X\theta, \sigma^2 I)\) and \(\theta \sim \mathcal{N}(0, \Sigma_0)\), then \[\mu_{\theta \mid y} = \Sigma_0 X^\top (X \Sigma_0 X^\top + \sigma^2 I)^{-1} y = (X^\top X + \sigma^2 \Sigma_0^{-1})^{-1} X^\top y\] Setting \(\Sigma_0 = \tau^2 I\) gives ridge with \(\lambda = \sigma^2/\tau^2\); taking \(\tau^2 \to \infty\) recovers OLS.

5.4 Lasso Regression (L1 Regularization)

Objective: \[L(\theta) = \|X\theta - y\|^2 + \lambda \|\theta\|_1 = \|X\theta - y\|^2 + \lambda \sum_{j=1}^d |\theta_j|\]

Key Properties:

  • No closed form \(\to\) solved via coordinate descent or proximal gradient

  • Bayesian view: MAP with Laplace prior \(p(\theta_j) \propto \exp(-\lambda|\theta_j|)\)

  • Sparsity: Sets many \(\theta_j\) to exactly zero \(\to\) automatic feature selection

  • Soft-thresholding operator: For coordinate update, \[\theta_j \leftarrow S_\lambda(z) = \text{sign}(z) \cdot \max(|z| - \lambda, 0)\] where \(z = \frac{1}{n} \sum_{i=1}^n x_{ij}(y_i - \sum_{k \neq j} x_{ik}\theta_k)\) (residual correlation)

Lasso vs Ridge:

  • Lasso: Sparse solutions (feature selection), works well with irrelevant features

  • Ridge: Dense solutions (all features retained), better when all features are relevant

  • Elastic Net: Combines both: \(\lambda_1 \|\theta\|_1 + \lambda_2 \|\theta\|^2\)

5.5 Huber Loss (Robust Regression)

Huber loss is robust to outliers, blending L2 (small errors) and L1 (large errors): \[\ell_\delta(r) = \begin{cases} \frac{1}{2} r^2, & |r| \le \delta \\ \delta(|r| - \frac{\delta}{2}), & |r| > \delta \end{cases}\]

Why: Quadratic for small residuals (efficient), linear for large residuals (robust to outliers).

TipExample

When to Use:

  • OLS: Clean data, want interpretability

  • Ridge: Multicollinearity, all features relevant

  • Lasso: High-dimensional data (\(d > n\)), want sparse model

  • Huber: Data with outliers (e.g., sensor noise)

6 Statistical Inference Essentials

6.1 Confidence Intervals

A confidence interval for parameter \(\theta\) at level \((1-\alpha)\) (e.g., 95% when \(\alpha=0.05\)) is an interval \([L, U]\) such that: \[P(L \le \theta \le U) = 1 - \alpha\]

Interpretation: If we repeat the experiment many times and construct a 95% CI each time, approximately 95% of those intervals will contain the true parameter \(\theta\). Not "95% probability \(\theta\) is in this specific interval" (that’s Bayesian credible intervals).

6.1.1 Confidence Intervals for OLS Coefficients

Under \(y = X\theta + \varepsilon\) with \(\varepsilon \sim \mathcal{N}(0, \sigma^2 I)\): \[\hat{\theta}_j \sim \mathcal{N}\big(\theta_j, \sigma^2 [(X^TX)^{-1}]_{jj}\big)\]

Standard Error: \(\text{SE}(\hat{\theta}_j) = \hat{\sigma} \sqrt{[(X^TX)^{-1}]_{jj}}\) where \(\hat{\sigma}^2 = \frac{1}{n-d}\|y - X\hat{\theta}\|^2\) (unbiased estimator).

95% Confidence Interval: \[\hat{\theta}_j \pm t_{n-d, 0.975} \cdot \text{SE}(\hat{\theta}_j)\]

where \(t_{n-d, 0.975}\) is the 97.5th percentile of Student’s t-distribution with \(n-d\) degrees of freedom. For large \(n\), use \(z_{0.975} \approx 1.96\) (normal approximation).

Common Critical Values:

  • 90% CI: \(z_{0.95} = 1.645\)

  • 95% CI: \(z_{0.975} = 1.96\)

  • 99% CI: \(z_{0.995} = 2.576\)

6.2 Hypothesis Testing

Null Hypothesis \(H_0\): Default assumption (e.g., \(\theta_j = 0\) for "no effect")

Alternative \(H_1\): What we want to show (e.g., \(\theta_j \ne 0\))

Test Statistic: \[t = \frac{\hat{\theta}_j - 0}{\text{SE}(\hat{\theta}_j)} \sim t_{n-d} \quad \text{under } H_0\]

P-value: Probability of observing test statistic as extreme as \(t\) (or more extreme) under \(H_0\). \[p\text{-value} = P(|T| \ge |t| \mid H_0)\]

Decision Rule: Reject \(H_0\) if \(p\text{-value} < \alpha\) (significance level, typically 0.05).

Connection to CIs: If 95% CI for \(\theta_j\) excludes 0, then \(p < 0.05\) for testing \(H_0: \theta_j = 0\).

6.3 Asymptotic Confidence Intervals (General)

For any consistent estimator \(\hat{\theta}\) with asymptotic normality: \[\sqrt{n}(\hat{\theta} - \theta) \xrightarrow{d} \mathcal{N}(0, V)\]

95% Asymptotic CI: \[\hat{\theta} \pm 1.96 \cdot \frac{\hat{\sigma}}{\sqrt{n}}\]

where \(\hat{\sigma}^2\) estimates the asymptotic variance \(V\).

Bootstrap CIs (Non-parametric):

  1. Resample data with replacement \(B\) times (e.g., \(B=1000\))

  2. Compute \(\hat{\theta}^{(b)}\) for each bootstrap sample

  3. Percentile method: 95% CI is \([\hat{\theta}_{2.5\%}^*, \hat{\theta}_{97.5\%}^*]\) (2.5th and 97.5th percentiles of bootstrap distribution)

Bootstrap works when parametric assumptions fail or standard error formulas are unknown.

6.4 Multiple Testing Correction

Testing \(m\) hypotheses simultaneously increases Type I error (false positives).

Bonferroni Correction: Use significance level \(\alpha/m\) for each test to maintain family-wise error rate \(\le \alpha\).

False Discovery Rate (FDR): Control expected proportion of false positives among rejections. Benjamini-Hochberg procedure is less conservative than Bonferroni.

TipExample

OLS Coefficient Interpretation

Fit \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \varepsilon\) on \(n=100\) samples. Output:

Coef Estimate Std Error p-value
\(\beta_1\) 0.85 0.12 \(<0.001\)
\(\beta_2\) 0.04 0.08 0.62

95% CI for \(\beta_1\): \(0.85 \pm 1.96 \times 0.12 = [0.61, 1.09]\) (significant)

95% CI for \(\beta_2\): \(0.04 \pm 1.96 \times 0.08 = [-0.12, 0.20]\) (includes 0, not significant)

Interpretation: Strong evidence that \(x_1\) affects \(y\) (p \(<\) 0.001). Insufficient evidence for \(x_2\) effect (p = 0.62).

Note

Interview Questions:

Q: What does a 95% confidence interval mean?

A: If we repeat the experiment many times, 95% of the constructed intervals will contain the true parameter. It’s a statement about the procedure, not the specific interval.

Q: How do you interpret a p-value of 0.03?

A: If the null hypothesis is true, there’s a 3% chance of observing data this extreme (or more). With \(\alpha=0.05\), we reject \(H_0\) (statistically significant). But always consider effect size and practical significance.

Q: When do you need multiple testing correction?

A: When testing many hypotheses simultaneously (e.g., testing 1000 features in genomics). Without correction, you’ll get false positives by chance (\(5\%\) of 1000 = 50 false positives at \(\alpha=0.05\)).

7 Exponential Families & GLMs

7.1 Exponential Family

A distribution belongs to the exponential family if: \[p(y \mid \eta) = h(y) \exp\big(\eta^\top T(y) - A(\eta)\big)\]

where \(\eta\) is the natural parameter, \(T(y)\) is the sufficient statistic, \(A(\eta)\) is the log-partition function, and \(h(y)\) is the base measure.

Key Examples:

Distribution Natural Param \(\eta\) Sufficient Stat \(T(y)\)
Bernoulli\((p)\) \(\log\frac{p}{1-p}\) \(y\)
Gaussian\((\mu, \sigma^2)\) \([\frac{\mu}{\sigma^2}, -\frac{1}{2\sigma^2}]\) \([y, y^2]\)
Poisson\((\lambda)\) \(\log \lambda\) \(y\)
Categorical\((p_1,\ldots,p_K)\) \([\log p_1, \ldots, \log p_{K-1}]\) \([\mathbb{1}(y=1), \ldots, \mathbb{1}(y=K-1)]\)

Properties:

  • \(\mathbb{E}(T(Y)) = \nabla_\eta A(\eta)\) (mean is derivative of log-partition function)

  • \(\text{Var}(T(Y)) = \nabla^2_\eta A(\eta)\) (variance is second derivative)

  • Sufficient statistic: \(T(y)\) captures all information needed to estimate \(\eta\) from data. For \(n\) i.i.d. observations, only \(\sum_{i=1}^n T(y_i)\) is needed (data compression with no information loss)

  • Conjugate priors exist for all exponential families

7.2 Generalized Linear Models (GLMs)

GLMs extend linear regression to non-Gaussian outcomes via a link function \(g\):

Framework:

  1. Linear predictor: \(\eta = X\theta\)

  2. Link function: \(\eta = g(\mu)\) where \(\mu = \mathbb{E}(Y \mid X)\)

  3. Response distribution from exponential family: \(Y \sim p(y \mid \eta)\)

Common GLMs:

Model Distribution Link \(g(\mu)\) Inverse \(g^{-1}(\eta)\)
Linear Regression Gaussian Identity: \(\mu\) \(\eta\)
Logistic Regression Bernoulli Logit: \(\log\frac{\mu}{1-\mu}\) \(\frac{1}{1+e^{-\eta}}\)
Poisson Regression Poisson Log: \(\log \mu\) \(e^\eta\)

Logistic Regression Derivation:

For binary \(y \in \{0,1\}\), Bernoulli distribution: \(p(y=1) = \mu\). Using logit link: \[\log\frac{\mu}{1-\mu} = X\theta \quad \Rightarrow \quad \mu = \frac{1}{1 + e^{-X\theta}} = \sigma(X\theta)\]

Negative log-likelihood (cross-entropy loss): \[L(\theta) = -\sum_{i=1}^n \left[y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)\right]\]

where \(\hat{y}_i = \sigma(x_i^\top \theta)\). No closed form → optimized via gradient descent.

Note

Interview Question: Why use logit link for binary classification?

Answer: Logit maps probabilities \((0,1)\) to real line \((-\infty, \infty)\), matching range of linear predictor \(X\theta\). It’s the canonical link for Bernoulli (exponential family property).

8 Probability Distributions & Conjugacy

8.1 Key Distributions

Discrete Distributions:

  • Bernoulli\((p)\): Single binary outcome. \(P(X=1) = p\)

    Physical process: Coin flip, binary sensor reading, yes/no decision

  • Binomial\((n,p)\): Number of successes in \(n\) independent Bernoulli trials

    \(P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}\)

    Physical process: Number of heads in \(n\) coin flips, number of defective items in batch

  • Geometric\((p)\): Number of trials until first success

    \(P(X=k) = (1-p)^{k-1} p\)

    Physical process: Number of independent Bernoulli trials until first success. Models memoryless discrete waiting times (e.g., radioactive decay, number of coin flips until first heads).

  • Poisson\((\lambda)\): Count of rare events in fixed interval

    \(P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}\)

    Physical process: Radioactive decay counts, number of emails per hour, rare disease cases. Arises as limit of Binomial\((n,p)\) when \(n \to \infty\), \(p \to 0\), \(np = \lambda\) fixed.

  • Categorical\((p_1,\ldots,p_K)\): One of \(K\) discrete outcomes (generalized Bernoulli)

    \(P(X=k) = p_k\), where \(\sum_k p_k = 1\)

    Physical process: Dice roll, word choice in language model, pixel class in segmentation

Continuous Distributions:

  • Gaussian/Normal\((\mu, \sigma^2)\): \(p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\)

    Physical process: Measurement errors, sum of many independent random variables (Central Limit Theorem). Natural consequence of maximum entropy with fixed mean and variance.

    Why ubiquitous: CLT guarantees that averages/sums converge to Gaussian regardless of underlying distribution.

  • Exponential\((\lambda)\): Time until next event in Poisson process

    \(p(x) = \lambda e^{-\lambda x}\), \(x \ge 0\)

    Physical process: Time between radioactive decays, customer inter-arrival times, component lifetime (memoryless)

  • Gamma\((\alpha, \beta)\): Sum of \(\alpha\) exponential waiting times

    \(p(x) \propto x^{\alpha-1} e^{-\beta x}\)

    Physical process: Time until \(k\)-th event in Poisson process. Used as prior for rates/scales (always positive).

  • Beta\((\alpha, \beta)\): Distribution over probabilities

    \(p(x) \propto x^{\alpha-1} (1-x)^{\beta-1}\), \(x \in [0,1]\)

    Use case: Bayesian prior for unknown probability parameter (e.g., coin bias). Flexible–can be uniform, U-shaped, or peaked.

  • Dirichlet\((\alpha_1,\ldots,\alpha_K)\): Distribution over probability vectors

    \(p(x_1,\ldots,x_K) \propto \prod_{i=1}^K x_i^{\alpha_i-1}\) where \(x_i \ge 0\), \(\sum_{i=1}^K x_i = 1\) (probability simplex)

    Simplex: The \((K-1)\)-dimensional space where all \(x_i \ge 0\) and sum to 1. For \(K=3\): a triangle in 3D space.

    Use case: Bayesian prior for categorical probabilities (generalized Beta to \(K\) dimensions). Used in topic modeling, Bayesian mixture models.

  • Student-t\((\nu)\): Heavy-tailed alternative to Gaussian

    \(p(x) \propto \left(1 + \frac{x^2}{\nu}\right)^{-(\nu+1)/2}\) where \(\nu\) is degrees of freedom

    Mathematical origin: Ratio \(\frac{Z}{\sqrt{V/\nu}}\) where \(Z \sim \mathcal{N}(0,1)\), \(V \sim \chi^2_\nu\). Arises when estimating mean with unknown variance from small samples.

    Physical process: Robust regression with outliers, small-sample inference (t-tests), financial returns (heavy tails). As \(\nu \to \infty\), converges to Gaussian; small \(\nu\) gives heavy tails (more outliers).

    Why useful: More robust to outliers than Gaussian due to heavier tails. Common choice: \(\nu=3\) or \(\nu=4\) for robust modeling.

Note

Central Limit Theorem (CLT):

Why is Gaussian everywhere? If \(X_1, \ldots, X_n\) are i.i.d. with mean \(\mu\) and finite variance \(\sigma^2\): \[\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} \mathcal{N}(0,1)\]

Implication: Sum/average of many independent effects \(\to\) Gaussian, regardless of individual distributions. This explains measurement errors, test scores, heights, etc.

8.2 Conjugate Priors

Conjugate prior: If prior \(p(\theta)\) and likelihood \(p(x|\theta)\) are conjugate, posterior \(p(\theta|x) \propto p(x|\theta)p(\theta)\) has the same functional form as the prior.

Why Conjugacy Matters:

  • Closed-form updates: No need for numerical integration or MCMC

  • Interpretable hyperparameters: Prior parameters have clear meaning (pseudo-counts, effective sample size)

  • Sequential updates: Process data in batches–posterior becomes new prior

  • Computational efficiency: Update in \(O(1)\) time per observation

Key Conjugate Pairs:

Likelihood Conjugate Prior Posterior Update
Bernoulli\((p)\) Beta\((\alpha, \beta)\) Beta\((\alpha + k, \beta + n - k)\)
Categorical\((p)\) Dirichlet\((\alpha)\) Dirichlet\((\alpha + \text{counts})\)
Poisson\((\lambda)\) Gamma\((\alpha, \beta)\) Gamma\((\alpha + \sum x_i, \beta + n)\)
Gaussian\((\mu, \sigma^2)\) Gaussian\((\mu_0, \sigma_0^2)\) Gaussian (precision-weighted avg)
TipExample

Beta-Bernoulli: Bayesian Coin Flipping

Setup: Unknown coin bias \(p\). Prior belief: Beta\((\alpha, \beta)\).

Observe data: \(k\) heads in \(n\) flips.

Posterior: \(p \mid \text{data} \sim \text{Beta}(\alpha + k, \beta + n - k)\)

Interpretation:

  • \(\alpha\): pseudo-count of prior "heads"

  • \(\beta\): pseudo-count of prior "tails"

  • Total prior strength: \(\alpha + \beta\) (equivalent sample size)

  • Posterior mean: \(\frac{\alpha + k}{\alpha + \beta + n}\) (weighted average of prior and data)

Special Cases:

  • Beta\((1,1)\): Uniform prior (no preference)

  • Beta\((0.5, 0.5)\): Jeffreys prior (scale-invariant)

  • Beta\((100, 100)\): Strong prior belief that \(p \approx 0.5\)

Concrete Example:

Prior: Beta\((2, 2)\) (weak belief \(p \approx 0.5\))

Observe: 7 heads in 10 flips

Posterior: Beta\((2+7, 2+3) =\) Beta\((9, 5)\)

Posterior mean: \(\frac{9}{14} \approx 0.64\) (vs MLE \(= 0.7\))

The Bayesian estimate is "shrunk" toward prior mean \(0.5\).

TipExample

Gamma-Poisson: Event Rate Estimation

Setup: Unknown event rate \(\lambda\) (e.g., emails per hour). Prior: Gamma\((\alpha, \beta)\).

Observe: Total count \(\sum_{i=1}^n x_i\) over \(n\) time intervals.

Posterior: \(\lambda \mid \text{data} \sim \text{Gamma}\left(\alpha + \sum x_i, \beta + n\right)\)

Interpretation:

  • \(\alpha\): pseudo-count of prior events

  • \(\beta\): pseudo-count of prior time intervals

  • Prior mean: \(\frac{\alpha}{\beta}\)

  • Posterior mean: \(\frac{\alpha + \sum x_i}{\beta + n}\) (data dominates as \(n \to \infty\))

Note

Interview Insight: Conjugate priors are mathematically convenient but not always realistic. Modern Bayesian inference uses MCMC/variational methods for non-conjugate models. However, conjugacy remains fundamental for:

  • Hierarchical models (conjugate at each level)

  • Online learning (sequential Bayesian updates)

  • Theoretical analysis (derive closed-form posteriors)

8.3 Distribution Relationships

Key Transformations:

  • \(Z \sim \mathcal{N}(0,1) \Rightarrow Z^2 \sim \chi^2_1\) (chi-squared with 1 d.f.)

  • \(\sum_{i=1}^k Z_i^2 \sim \chi^2_k\) where \(Z_i \sim \mathcal{N}(0,1)\) i.i.d.

  • \(\frac{Z}{\sqrt{V/\nu}} \sim t_\nu\) (Student’s t) where \(Z \sim \mathcal{N}(0,1)\), \(V \sim \chi^2_\nu\)

  • \(X \sim \mathcal{N}(\mu, \sigma^2) \Rightarrow e^X \sim \text{Log-normal}(\mu, \sigma^2)\)

  • \(X \sim \text{Exponential}(\lambda) \Rightarrow \lfloor X \rfloor \sim \text{Geometric}\) (discretize continuous waiting time)

  • Sum of \(n\) Bernoulli\((p)\) trials \(\Rightarrow\) Binomial\((n,p)\)

  • Sum of \(n\) Exponential\((\lambda)\) r.v.s \(\Rightarrow\) Gamma\((n, \lambda)\) (Erlang distribution)

  • Limit relationships (CLT):

    • Binomial\((n,p) \approx \mathcal{N}(np, np(1-p))\) for large \(n\)

    • Poisson\((\lambda) \approx \mathcal{N}(\lambda, \lambda)\) for large \(\lambda\)

    • Binomial\((n,p) \approx\) Poisson\((np)\) when \(n\) large, \(p\) small, \(np\) moderate

Probability distribution relationships: transformations (black solid), conjugate priors (blue), limit/CLT (orange dashed). Green boxes = continuous priors, yellow = discrete. Each box includes the distribution’s key formula. Adapted from Leemis & McQueston (2008).
Note

Interview Tip: This diagram summarizes the famous Leemis & McQueston chart1. Key points:

  • Transformations: \(Z^2 \to \chi^2\), \(e^X \to\) Log-normal, floor(Exp) \(\to\) Geometric

  • Conjugate pairs: Beta-Bernoulli, Gamma-Poisson, Dirichlet-Categorical

  • CLT limits: Binomial/Poisson \(\to\) Gaussian for large \(n\) or \(\lambda\)

  • Sums: Bernoullis \(\to\) Binomial, Exponentials \(\to\) Gamma (Erlang)

Frequently asked: "How are chi-squared and Gaussian related?" → \(\chi^2_k\) is sum of \(k\) squared standard normals.

9 Sampling Methods

9.1 Monte Carlo Basics

Monte Carlo estimation: Approximate expectations by sampling: \[\mathbb{E}_{p(x)}[f(x)] \approx \frac{1}{N} \sum_{i=1}^N f(x_i), \quad x_i \sim p(x)\]

Law of Large Numbers: \(\frac{1}{N} \sum_{i=1}^N f(x_i) \xrightarrow{a.s.} \mathbb{E}[f(x)]\) as \(N \to \infty\)

Variance reduction: \(\text{Var}\left[\frac{1}{N}\sum f(x_i)\right] = \frac{\text{Var}[f(x)]}{N}\) → error \(\propto 1/\sqrt{N}\)

TipExample

Monte Carlo Integration

Estimate \(\int_0^1 x^2 dx\) (true value = \(1/3\)):

Sample \(x_i \sim \text{Uniform}(0,1)\), compute \(\frac{1}{N}\sum x_i^2\)

With \(N=10^4\) samples: \(\approx 0.3334 \pm 0.001\) (standard error)

9.2 Importance Sampling

Problem: Can’t sample directly from \(p(x)\), but can evaluate \(\tilde{p}(x) = Z \cdot p(x)\) (unnormalized)

Solution: Sample from proposal \(q(x)\) and reweight: \[\mathbb{E}_{p}[f(x)] = \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx \approx \frac{1}{N} \sum_{i=1}^N w_i f(x_i)\] where \(x_i \sim q(x)\) and importance weights \(w_i = \frac{p(x_i)}{q(x_i)}\) (or normalized: \(w_i = \frac{\tilde{p}(x_i)/q(x_i)}{\sum_j \tilde{p}(x_j)/q(x_j)}\))

Normalized weights (derivation): If \(\tilde{p}(x) = Z p(x)\), then \[\mathbb{E}_{p}[f(x)] = \frac{1}{Z} \int f(x) \tilde{p}(x) dx = \frac{\int f(x)\frac{\tilde{p}(x)}{q(x)} q(x) dx}{\int \frac{\tilde{p}(x)}{q(x)} q(x) dx} \approx \frac{\sum_i f(x_i)\tilde{p}(x_i)/q(x_i)}{\sum_i \tilde{p}(x_i)/q(x_i)}\] which yields normalized weights \(w_i = \frac{\tilde{p}(x_i)/q(x_i)}{\sum_j \tilde{p}(x_j)/q(x_j)}\).

Key insight: Good proposal \(q(x)\) should have heavy tails covering \(p(x)\) support.

Effective Sample Size (ESS): \(\text{ESS} = \frac{1}{\sum_i \tilde{w}_i^2}\) where \(\tilde{w}_i\) are normalized weights. Low ESS → poor proposal.

9.3 Markov Chain Monte Carlo (MCMC)

Goal: Sample from complex posterior \(p(\theta \mid x)\) when direct sampling is intractable.

9.3.1 The Fixed-Point View of MCMC

Core Idea: MCMC constructs a Markov transition operator whose fixed point (stationary distribution) is the target distribution we want to sample from.

You don’t sample from the distribution directly–you design dynamics whose long-run behavior equals that distribution.

Operator Perspective:

Let:

  • \(\pi(x)\) = target distribution (known up to normalization)

  • \(K(x \to x')\) = Markov transition kernel

Define the distribution update operator: \[(T\mu)(x') = \int \mu(x) K(x \to x') \, dx\]

This means: "If the current distribution is \(\mu\), after one Markov step it becomes \(T\mu\)."

The MCMC Goal: Construct \(K\) such that: \[T\pi = \pi \quad \text{($\pi$ is a fixed point of $T$)}\]

Why Fixed Points Are Sufficient:

If the chain is ergodic (can reach everywhere), aperiodic, and stable, then repeated application converges: \[\mu_0 \xrightarrow{T} \mu_1 \xrightarrow{T} \mu_2 \xrightarrow{T} \cdots \to \pi\]

\(\Rightarrow\) Even if you start with the wrong distribution, the dynamics wash out initialization.

Note

Key Insight: You don’t need to know how to draw from \(\pi\); you just need to know how to leave it invariant.

9.3.2 Detailed Balance: Enforcing the Fixed Point

Detailed Balance Condition (sufficient but not necessary): \[\pi(x) K(x \to x') = \pi(x') K(x' \to x)\]

Interpretation: Probability mass flowing from \(x \to x'\) equals mass flowing from \(x' \to x\).

\(\Rightarrow\) No net drift once you’re at \(\pi\).

This guarantees the fixed point: \[\int \pi(x) K(x \to x') \, dx = \pi(x')\]

Physical Intuition: Think of MCMC as diffusion in an energy landscape with \(\pi(x) \propto e^{-E(x)}\):

  • Random motion (exploration)

  • Rejection discourages climbing energy too often

  • Chain wanders freely in low-energy regions

  • On average, spends time proportional to \(e^{-E(x)}\)

The stationary distribution is the equilibrium state of the dynamics.

9.3.3 Metropolis-Hastings Algorithm

MH as Fixed-Point Engineering:

Split the transition into two parts:

  1. Proposal kernel \(q(x'|x)\) \(\to\) suggests a move (can be arbitrary)

  2. Acceptance correction \(\to\) enforces invariance

Acceptance Probability: \[\alpha(x,x') = \min\left(1, \frac{\pi(x') q(x|x')}{\pi(x) q(x'|x)}\right)\]

Why This Works:

  • Proposals can be local, global, or even bad

  • Acceptance rebalances probability flow

  • The resulting effective kernel satisfies detailed balance

\(\Rightarrow\) MH: "I don’t know how to sample from \(\pi\), but I can correct a bad proposal so that \(\pi\) becomes a fixed point."

Algorithm:

  1. Initialize \(\theta^{(0)}\)

  2. For \(t = 1, 2, \ldots\):

    • Propose: \(\theta' \sim q(\cdot \mid \theta^{(t-1)})\)

    • Compute acceptance ratio: \(\alpha = \min\left(1, \frac{p(\theta') q(\theta^{(t-1)} \mid \theta')}{p(\theta^{(t-1)}) q(\theta' \mid \theta^{(t-1)})}\right)\)

    • Accept with probability \(\alpha\): \(\theta^{(t)} = \begin{cases} \theta' & \text{with prob } \alpha \\ \theta^{(t-1)} & \text{otherwise} \end{cases}\)

Symmetric proposal: If \(q(\theta' \mid \theta) = q(\theta \mid \theta')\) (e.g., random walk), then \(\alpha = \min\left(1, \frac{p(\theta')}{p(\theta^{(t-1)})}\right)\)

Note: Symmetric means the proposal kernel is the same in both directions: \(q(x'|x) = q(x|x')\) (e.g., Gaussian centered at current state).

TipExample

Numerical Example: Sampling from Beta(2,5)

Target: \(p(\theta) \propto \theta^{2-1}(1-\theta)^{5-1} = \theta(1-\theta)^4\) on \([0,1]\)

Proposal: Symmetric random walk: \(\theta' = \theta + \mathcal{N}(0, 0.1^2)\), reflected at boundaries

Iteration 1:

  • Current: \(\theta^{(0)} = 0.5\), \(p(\theta^{(0)}) = 0.5 \times 0.5^4 = 0.03125\)

  • Propose: \(\theta' = 0.5 + 0.08 = 0.58\)

  • Evaluate: \(p(\theta') = 0.58 \times 0.42^4 = 0.01806\)

  • Acceptance ratio: \(\alpha = \min(1, 0.01806/0.03125) = 0.578\)

  • Draw \(u \sim \text{Uniform}(0,1)\): \(u = 0.32 < 0.578\)Accept

  • Result: \(\theta^{(1)} = 0.58\)

Iteration 2:

  • Current: \(\theta^{(1)} = 0.58\)

  • Propose: \(\theta' = 0.58 - 0.12 = 0.46\)

  • Evaluate: \(p(\theta') = 0.46 \times 0.54^4 = 0.03920\)

  • Acceptance ratio: \(\alpha = \min(1, 0.03920/0.01806) = 1.0\) → always accept (uphill)

  • Result: \(\theta^{(2)} = 0.46\)

Key Insight: Moves toward higher probability always accepted; moves downhill accepted probabilistically. Chain explores \([0,1]\) proportionally to \(p(\theta)\).

Why Normalization Constants Don’t Matter:

Suppose \(\pi(x) = \frac{\tilde{\pi}(x)}{Z}\) where \(Z\) is unknown. In MH: \[\frac{\pi(x')}{\pi(x)} = \frac{\tilde{\pi}(x')}{\tilde{\pi}(x)}\]

The partition function \(Z\) cancels. \(\Rightarrow\) MCMC doesn’t compute distributions–it equalizes flows.

9.3.4 Gibbs Sampling

Special case: When conditional distributions \(p(\theta_i \mid \theta_{-i}, x)\) are easy to sample from.

Algorithm: Cycle through dimensions, sampling each conditional:

  1. Initialize \(\theta^{(0)} = (\theta_1^{(0)}, \ldots, \theta_d^{(0)})\)

  2. For \(t = 1, 2, \ldots\):

    • \(\theta_1^{(t)} \sim p(\theta_1 \mid \theta_2^{(t-1)}, \ldots, \theta_d^{(t-1)}, x)\)

    • \(\theta_2^{(t)} \sim p(\theta_2 \mid \theta_1^{(t)}, \theta_3^{(t-1)}, \ldots, \theta_d^{(t-1)}, x)\)

    • \(\vdots\)

    • \(\theta_d^{(t)} \sim p(\theta_d \mid \theta_1^{(t)}, \ldots, \theta_{d-1}^{(t)}, x)\)

Advantage: No tuning of proposal distribution (always accepts)

Disadvantage: Can have slow mixing if variables are highly correlated

9.3.5 The Control-Theory Analogy

Think of MCMC design as a control system:

  • \(\pi\) = equilibrium point

  • \(T\) = system dynamics

  • Detailed balance = zero steady-state error

  • Mixing rate = convergence speed

Bad MCMC: Has the right fixed point but terrible convergence (high autocorrelation)

Good MCMC: Same fixed point, faster contraction toward it

This explains why HMC, Langevin dynamics, and NUTS are all about improving the dynamics, not changing the target.

Note

One-Line Mental Model:

MCMC doesn’t sample distributions–it designs a Markov operator whose invariant measure is the distribution. Sampling is just running the dynamics long enough.

TipExample

Gibbs for Bayesian Linear Regression

Model: \(y \mid X, \beta, \sigma^2 \sim \mathcal{N}(X\beta, \sigma^2 I)\)

Priors: \(\beta \sim \mathcal{N}(0, \tau^2 I)\), \(\sigma^2 \sim \text{Inv-Gamma}(a, b)\)

Conditionals:

  • \(p(\beta \mid \sigma^2, y, X) = \mathcal{N}(\hat{\beta}, V)\) where \(V = (\frac{1}{\sigma^2}X^TX + \frac{1}{\tau^2}I)^{-1}\), \(\hat{\beta} = V \frac{1}{\sigma^2}X^Ty\)

  • \(p(\sigma^2 \mid \beta, y, X) = \text{Inv-Gamma}\left(a + \frac{n}{2}, b + \frac{1}{2}\|y - X\beta\|^2\right)\)

Alternate sampling these conditionals to get posterior samples of \((\beta, \sigma^2)\).

9.4 MCMC Diagnostics

Burn-in: Discard initial samples before chain reaches stationarity (typically first 10-50%)

Thinning: Keep every \(k\)-th sample to reduce autocorrelation (though not strictly necessary)

Convergence Diagnostics:

  • Trace plots: Visual inspection of \(\theta^{(t)}\) vs \(t\). Should look like "fuzzy caterpillar" with no trends.

  • Gelman-Rubin \(\hat{R}\): Run multiple chains, compare within-chain vs between-chain variance. Want \(\hat{R} < 1.1\).

  • Effective Sample Size (ESS): Accounts for autocorrelation. \(\text{ESS} \approx \frac{N}{1 + 2\sum_{k=1}^\infty \rho_k}\) where \(\rho_k\) is lag-\(k\) autocorrelation.

Mixing: How quickly chain explores the distribution. Poor mixing → high autocorrelation, low ESS.

Note

Interview Insight: Always mention diagnostics when discussing MCMC! Common mistake: "just run chain for 10,000 iterations" without checking convergence.

9.5 Hamiltonian Monte Carlo (HMC)

Motivation: Random walk MCMC has slow mixing in high dimensions. HMC uses gradient information to propose distant points with high acceptance.

Idea: Introduce auxiliary momentum variable \(p\) and simulate Hamiltonian dynamics: \[H(\theta, p) = -\log p(\theta \mid x) + \frac{1}{2}p^T M^{-1} p\]

Hamilton’s equations: \[\frac{d\theta}{dt} = M^{-1}p, \quad \frac{dp}{dt} = -\nabla_\theta (-\log p(\theta \mid x))\]

Algorithm (Leapfrog integration):

  1. Sample momentum: \(p^{(0)} \sim \mathcal{N}(0, M)\)

  2. Simulate dynamics for \(L\) steps with step size \(\epsilon\):

    • Half-step for \(p\): \(p^{(1/2)} = p^{(0)} + \frac{\epsilon}{2} \nabla \log p(\theta^{(0)})\)

    • Full-step for \(\theta\): \(\theta^{(1)} = \theta^{(0)} + \epsilon M^{-1} p^{(1/2)}\)

    • Iterate...

  3. Accept/reject with Metropolis step (corrects discretization error)

Advantages:

  • Proposes distant points (low autocorrelation)

  • High acceptance rate (\(\sim 0.65\) optimal)

  • Scales well to high dimensions

No-U-Turn Sampler (NUTS): Adaptive HMC that automatically tunes \(L\) and \(\epsilon\). Used in Stan/PyMC.

9.6 Variational Inference (VI)

Alternative to sampling: Approximate \(p(\theta \mid x)\) with simpler distribution \(q(\theta)\) from tractable family.

Objective: Minimize KL divergence \(\text{KL}(q \| p)\): \[q^* = \operatorname*{arg\,min}_{q \in \mathcal{Q}} \text{KL}(q(\theta) \| p(\theta \mid x))\]

Equivalently, maximize Evidence Lower Bound (ELBO): \[\mathcal{L}(q) = \mathbb{E}_q[\log p(x, \theta)] - \mathbb{E}_q[\log q(\theta)] = \log p(x) - \text{KL}(q \| p)\]

Mean-field approximation: Assume \(q(\theta) = \prod_{i=1}^d q_i(\theta_i)\) (factorized)

Coordinate ascent VI (CAVI): Update each \(q_i\) holding others fixed: \[q_i^*(\theta_i) \propto \exp\left(\mathbb{E}_{q_{-i}}[\log p(x, \theta)]\right)\]

Note

VI vs MCMC Trade-offs

MCMC (e.g., HMC):

  • \(\checkmark\) Asymptotically exact (given convergence)

  • \(\checkmark\) Quantifies uncertainty well

  • \(\times\) Slow (requires many samples)

  • \(\times\) Difficult to diagnose convergence

Variational Inference:

  • \(\checkmark\) Fast (optimization, not sampling)

  • \(\checkmark\) Scales to large datasets (stochastic VI)

  • \(\times\) Approximate (biased by choice of \(q\))

  • \(\times\) Underestimates uncertainty (mode-seeking KL)

Note

Interview Question: "When would you use MCMC vs variational inference?"

Answer: MCMC when accuracy is critical and computational budget allows (e.g., scientific modeling). VI when speed matters and approximate posterior is acceptable (e.g., large-scale Bayesian deep learning, recommendation systems).

10 Latent Variable Models & EM

10.1 Mixture Models

Mixture of Gaussians (GMM): Model complex distributions as weighted sum of Gaussians: \[p(x) = \sum_{k=1}^K \pi_k \mathcal{N}(x \mid \mu_k, \Sigma_k)\]

where \(\pi_k\) are mixture weights (\(\sum_k \pi_k = 1\)), \(\mu_k\) are means, \(\Sigma_k\) are covariances.

Latent variable interpretation: Introduce hidden cluster assignment \(z \in \{1,\ldots,K\}\): \[\begin{align*} p(z=k) & = \pi_k \\ p(x \mid z=k) & = \mathcal{N}(x \mid \mu_k, \Sigma_k) \\ p(x) & = \sum_k p(z=k) p(x \mid z=k) \end{align*}\]

10.2 Expectation-Maximization (EM) Algorithm

Problem: Maximize incomplete data log-likelihood \(\log p(x \mid \theta) = \log \sum_z p(x, z \mid \theta)\) (sum inside log → hard).

Idea: Lower-bound via Jensen’s inequality. For any distribution \(q(z)\): \[\log p(x \mid \theta) = \log \sum_z p(x,z \mid \theta) = \log \sum_z q(z) \frac{p(x,z \mid \theta)}{q(z)} \ge \sum_z q(z) \log \frac{p(x,z \mid \theta)}{q(z)} = \mathcal{L}(q, \theta)\]

This is the Evidence Lower Bound (ELBO).

EM Iterations:

  1. E-step: Fix \(\theta\), maximize \(\mathcal{L}\) w.r.t. \(q\): \[q^*(z) = p(z \mid x, \theta^{\text{old}}) \quad \text{(posterior)}\]

  2. M-step: Fix \(q\), maximize \(\mathcal{L}\) w.r.t. \(\theta\): \[\theta^{\text{new}} = \operatorname*{arg\,max}_\theta \sum_z q(z) \log p(x, z \mid \theta)\]

Guarantees:

  • \(\log p(x \mid \theta^{t+1}) \ge \log p(x \mid \theta^t)\) (monotonic improvement)

  • Converges to local optimum (not guaranteed global)

GMM E-step: Compute responsibilities (soft cluster assignments): \[\gamma_{ik} = p(z_i = k \mid x_i, \theta) = \frac{\pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(x_i \mid \mu_j, \Sigma_j)}\]

GMM M-step: Update parameters using weighted MLE: \[\begin{align*} \pi_k & = \frac{1}{n} \sum_{i=1}^n \gamma_{ik} \\ \mu_k & = \frac{\sum_{i=1}^n \gamma_{ik} x_i}{\sum_{i=1}^n \gamma_{ik}} \\ \Sigma_k & = \frac{\sum_{i=1}^n \gamma_{ik} (x_i - \mu_k)(x_i - \mu_k)^\top}{\sum_{i=1}^n \gamma_{ik}} \end{align*}\]

TipExample

EM Applications:

  • GMM clustering (soft k-means)

  • Hidden Markov Models (Baum-Welch algorithm)

  • Topic modeling (Latent Dirichlet Allocation)

  • Missing data imputation

11 Clustering & Dimensionality Reduction

11.1 k-Means Clustering

Objective: Partition \(n\) points into \(K\) clusters minimizing within-cluster variance: \[\min_{\mu_1,\ldots,\mu_K} \sum_{i=1}^n \min_{k \in \{1,\ldots,K\}} \|x_i - \mu_k\|^2\]

Lloyd’s Algorithm:

  1. Assignment: Assign each \(x_i\) to nearest centroid: \(c_i = \operatorname*{arg\,min}_k \|x_i - \mu_k\|\)

  2. Update: Recompute centroids: \(\mu_k = \frac{1}{|C_k|} \sum_{i \in C_k} x_i\)

  3. Repeat until convergence

Connection to EM: k-means is hard-EM for GMM with spherical covariances (\(\Sigma_k = \sigma^2 I\)) in limit \(\sigma \to 0\).

Limitations:

  • Assumes spherical clusters

  • Sensitive to initialization (use k-means++)

  • Must specify \(K\) (use elbow method or silhouette score)

  • Local optima

11.2 Principal Component Analysis (PCA)

Goal: Find \(d\)-dimensional linear subspace that captures maximum variance of data \(X \in \mathbb{R}^{n \times p}\) (assume centered: \(\sum_i x_i = 0\)).

Variance Maximization: First principal component \(w_1\) solves: \[\max_{\|w\|=1} w^\top S w\] where \(S = \frac{1}{n} X^\top X\) is sample covariance matrix. Solution: \(w_1\) is leading eigenvector of \(S\).

Reconstruction View: Minimize reconstruction error: \[\min_{W \in \mathbb{R}^{p \times d}} \sum_{i=1}^n \|x_i - WW^\top x_i\|^2\] Solution: \(W\) contains top \(d\) eigenvectors of \(S\).

Algorithm (via SVD):

  1. Center data: \(X \leftarrow X - \bar{x}\)

  2. Compute SVD: \(X = U \Sigma V^\top\)

  3. Principal components: columns of \(V\) (right singular vectors)

  4. Projections: \(Z = XV\) (scores)

  5. Variance explained: \(\sigma_i^2 / \sum_j \sigma_j^2\)

Probabilistic PCA (PPCA):

Generative model: \[\begin{align*} z & \sim \mathcal{N}(0, I_d) \\ x \mid z & \sim \mathcal{N}(Wz + \mu, \sigma^2 I) \end{align*}\]

MLE recovers standard PCA in limit \(\sigma \to 0\). Advantage: Principled handling of missing data.

11.3 Independent Component Analysis (ICA)

Problem: Blind source separation. Given mixtures \(x = As\) where \(s\) are independent sources, recover \(A\) and \(s\).

Assumption: Sources \(s_i\) are non-Gaussian and independent. (If Gaussian, ICA reduces to PCA.)

Approach: Find unmixing matrix \(W\) such that \(y = Wx\) has maximally non-Gaussian components.

Non-Gaussianity Measures:

  • Kurtosis: \(\text{kurt}(y) = \mathbb{E}(y^4) - 3(\mathbb{E}(y^2))^2\) (zero for Gaussian)

  • Negentropy: \(J(y) = H(\text{Gauss}) - H(y)\) (differential entropy)

ICA vs PCA:

  • PCA: Orthogonal, maximizes variance (second-order statistics)

  • ICA: Non-orthogonal, maximizes independence (higher-order statistics)

  • Use ICA when sources are truly independent (cocktail party, EEG)

Note

Interview Question: When would PCA fail but ICA succeed?

Answer: When sources are independent but uncorrelated (e.g., uniform distributions). PCA uses only covariance (second moment), ICA uses higher-order moments to detect non-Gaussian structure.

12 Kernel Methods & Support Vector Machines

12.1 The Kernel Trick

Many algorithms depend on data only through inner products \(x_i^\top x_j\). Kernel trick: Replace with \(k(x_i, x_j) = \phi(x_i)^\top \phi(x_j)\) where \(\phi\) maps to high-dimensional feature space–without computing \(\phi\) explicitly.

Popular Kernels:

  • Linear: \(k(x, x') = x^\top x'\)

  • Polynomial: \(k(x, x') = (x^\top x' + c)^d\)

  • RBF (Gaussian): \(k(x, x') = \exp\left(-\frac{\|x - x'\|^2}{2\sigma^2}\right)\)

  • Laplacian: \(k(x, x') = \exp(-\gamma \|x - x'\|_1)\)

Valid Kernel: Must be positive semi-definite (Mercer’s theorem).

12.2 Support Vector Machines (SVM)

Geometric Intuition:

Binary classification as geometry: Given labeled points \((x_i, y_i)\) with \(y_i \in \{-1, +1\}\), you want a hyperplane \(w^\top x + b = 0\) that separates the two classes.

Key insight: Many separating planes exist–SVM picks the one that maximizes the margin (distance to nearest points).

image

Margin mechanics:

  • For a point \(x\), signed distance to hyperplane: \(\frac{w^\top x + b}{\|w\|}\)

  • SVM enforces: \(y_i(w^\top x_i + b) \ge 1\) (all points at least distance \(\frac{1}{\|w\|}\) from boundary)

  • Closest points lie on \(w^\top x + b = \pm 1\) (the margin slab)

  • Margin width: \(\frac{2}{\|w\|}\) \(\Rightarrow\) maximize margin by minimizing \(\|w\|\)

  • Only points on margin boundaries (circled) are support vectors–they define the solution

Hard-Margin SVM (linearly separable): \[\min_{w, b} \frac{1}{2} \|w\|^2 \quad \text{s.t.} \quad y_i(w^\top x_i + b) \ge 1, \, \forall i\]

Geometric margin: \(\frac{1}{\|w\|}\). Maximizing margin \(\Leftrightarrow\) minimizing \(\|w\|^2\).

Dual Formulation (Lagrangian): \[\max_{\alpha \ge 0} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j=1}^n \alpha_i \alpha_j y_i y_j x_i^\top x_j \quad \text{s.t.} \quad \sum_{i=1}^n \alpha_i y_i = 0\]

Solution: \(w = \sum_{i=1}^n \alpha_i y_i x_i\) (only support vectors with \(\alpha_i > 0\) matter).

Prediction: \(f(x) = \text{sign}\left(\sum_{i \in SV} \alpha_i y_i x_i^\top x + b\right)\)

Soft-Margin SVM (non-separable): Introduce slack variables \(\xi_i \ge 0\): \[\min_{w, b, \xi} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^n \xi_i \quad \text{s.t.} \quad y_i(w^\top x_i + b) \ge 1 - \xi_i\]

\(C\) controls trade-off: large \(C\) → small margin, few errors (overfit); small \(C\) → large margin, more errors (underfit).

Kernel SVM: Replace \(x_i^\top x_j\) with \(k(x_i, x_j)\) in dual: \[f(x) = \text{sign}\left(\sum_{i \in SV} \alpha_i y_i k(x_i, x) + b\right)\]

Note

Why SVM?

  • Effective in high dimensions (\(d \gg n\))

  • Memory efficient (only stores support vectors)

  • Versatile (different kernels for non-linear boundaries)

  • Convex optimization → global optimum

Downside: Slow on large datasets (\(O(n^3)\) training), doesn’t scale to millions of samples like neural nets.

Practical Applications:

  • Activity recognition: Walking mode classification (walking/running/stairs) from accelerometer/gyroscope sensor data. RBF kernel handles non-linear decision boundaries between activities.

  • Text classification: Spam detection, sentiment analysis (high-dimensional sparse features from TF-IDF)

  • Bioinformatics: Protein classification, gene expression analysis (small \(n\), high \(d\))

  • Image recognition: Face detection (HOG features + SVM was state-of-the-art pre-deep learning)

13 Tree-Based Methods

13.1 Decision Trees

Core Idea: Recursively partition feature space into axis-aligned rectangular regions, assigning constant predictions to each region.

Recursive Splitting Algorithm:

  1. Start with all data at root node

  2. For each node:

    • Try all features \(j = 1, \ldots, p\)

    • For each feature, try all split points \(s\) (e.g., \(x_j < s\) vs \(x_j \ge s\))

    • Choose \((j^*, s^*)\) that maximizes information gain (classification) or minimizes MSE (regression)

    • Split data into left/right children based on \(x_{j^*} < s^*\)

  3. Recurse on children until stopping criterion (max depth, min samples, purity threshold)

  4. Assign leaf prediction: majority class (classification) or mean \(y\) (regression)

Example Split Search (Classification):

  • Node has 100 samples: 60 class A, 40 class B → Entropy \(H = -0.6\log(0.6) - 0.4\log(0.4) = 0.971\)

  • Try split: "Age \(< 30\)" → Left: 70 samples (50A, 20B), Right: 30 samples (10A, 20B)

  • Left entropy: \(H_L = -\frac{50}{70}\log\frac{50}{70} - \frac{20}{70}\log\frac{20}{70} = 0.863\)

  • Right entropy: \(H_R = -\frac{10}{30}\log\frac{10}{30} - \frac{20}{30}\log\frac{20}{30} = 0.918\)

  • Weighted child entropy: \(\frac{70}{100}(0.863) + \frac{30}{100}(0.918) = 0.879\)

  • Information gain: \(0.971 - 0.879 = 0.092\) → Keep this split if best among all candidates

Splitting Criteria:

Entropy (classification): \[H(p) = -\sum_{k=1}^K p_k \log p_k\]

Gini impurity: \[G(p) = \sum_{k=1}^K p_k(1 - p_k) = 1 - \sum_{k=1}^K p_k^2\]

Information gain: Reduction in entropy after split: \[IG = H(\text{parent}) - \sum_{\text{child}} \frac{|C|}{|P|} H(\text{child})\]

Regression Trees: Minimize MSE instead: \[\min_{j, s} \left[\min_{c_1} \sum_{x_i \in R_1(j,s)} (y_i - c_1)^2 + \min_{c_2} \sum_{x_i \in R_2(j,s)} (y_i - c_2)^2\right]\] where \(j\) is feature index, \(s\) is split threshold, \(R_1, R_2\) are resulting regions, \(c_1, c_2\) are region predictions (mean \(y\)).

Limitations:

  • High variance (small data change → different tree)

  • Greedy splitting (locally optimal)

  • Prone to overfitting (need pruning or ensembles)

13.2 Bagging & Random Forests

Bagging (Bootstrap Aggregating):

  1. Generate \(B\) bootstrap samples (sample \(n\) points with replacement)

  2. Train model \(f_b\) on each bootstrap sample

  3. Aggregate: \(\hat{f}(x) = \frac{1}{B} \sum_{b=1}^B f_b(x)\) (regression) or majority vote (classification)

Effect: Reduces variance without increasing bias.

Random Forest = Bagging + feature randomness:

  • At each split, consider only random subset of \(m\) features (typically \(m \approx \sqrt{p}\))

  • Decorrelates trees → further variance reduction

  • Out-of-bag (OOB) error: Free cross-validation estimate

Why Random Forests Work:

  • Averaging \(B\) trees with variance \(\sigma^2\) and correlation \(\rho\): \[\text{Var}(\bar{f}) = \rho \sigma^2 + \frac{1-\rho}{B} \sigma^2\]

  • Feature randomness reduces \(\rho\) → lower variance even with correlated features

13.3 Boosting

Boosting: Sequentially train weak learners, each correcting errors of previous ones.

AdaBoost (Adaptive Boosting):

  1. Initialize weights: \(w_i = \frac{1}{n}\)

  2. For \(t = 1, \ldots, T\):

    • Train weak learner \(h_t\) on weighted data

    • Compute error: \(\epsilon_t = \sum_{i: h_t(x_i) \ne y_i} w_i\)

    • Compute learner weight: \(\alpha_t = \frac{1}{2} \log\frac{1-\epsilon_t}{\epsilon_t}\)

    • Update weights: \(w_i \leftarrow w_i \exp(-\alpha_t y_i h_t(x_i))\), renormalize

  3. Final model: \(H(x) = \text{sign}\left(\sum_{t=1}^T \alpha_t h_t(x)\right)\)

Gradient Boosting:

View boosting as gradient descent in function space. At iteration \(t\): \[f_t(x) = f_{t-1}(x) + \eta h_t(x)\] where \(h_t\) fits the negative gradient (residuals) of loss w.r.t. \(f_{t-1}\).

For squared loss: \(h_t\) fits residuals \(r_i = y_i - f_{t-1}(x_i)\).

Modern Implementations:

  • XGBoost: Regularized gradient boosting, parallel tree building

  • LightGBM: Histogram-based, faster on large datasets

  • CatBoost: Handles categorical features natively

Note

Bagging vs Boosting:

  • Bagging: Parallel training, reduces variance, less prone to overfitting

  • Boosting: Sequential training, reduces bias, can overfit if not regularized

  • Random Forest: Use when features are noisy/correlated

  • Gradient Boosting: Use for maximum accuracy on clean data (Kaggle winner)

14 Missing Data & Imputation

Problem: Real-world datasets often have missing values. How to handle them without biasing estimates or losing information?

14.1 Types of Missing Data

Understanding the missingness mechanism is crucial for valid imputation:

  • Missing Completely At Random (MCAR): Missingness independent of observed and unobserved data

    • Example: Survey data lost due to random server crash

    • Safe to delete or use simple imputation

    • Rare in practice

  • Missing At Random (MAR): Missingness depends on observed data but not unobserved data

    • Example: High-income individuals more likely to skip income question, but income can be predicted from education/occupation

    • Can impute using observed features

    • Most common assumption in practice

  • Missing Not At Random (MNAR): Missingness depends on the unobserved value itself

    • Example: People with very high income skip income question because it’s high

    • Cannot be fixed by standard imputation

    • Requires domain knowledge or specialized methods

14.2 Simple Imputation Methods

Mean/Median/Mode Imputation: \[x_{\text{missing}} \leftarrow \begin{cases} \text{mean}(x_{\text{observed}}) & \text{continuous, roughly symmetric} \\ \text{median}(x_{\text{observed}}) & \text{continuous, skewed/outliers} \\ \text{mode}(x_{\text{observed}}) & \text{categorical} \end{cases}\]

Pros: Fast, deterministic, preserves sample size

Cons: Underestimates variance, distorts correlations, ignores relationships between features

Forward/Backward Fill (for time series):

  • Forward fill: \(x_t \leftarrow x_{t-1}\) (carry last observation)

  • Backward fill: \(x_t \leftarrow x_{t+1}\) (use next observation)

  • Use when values change slowly over time

Indicator Variables:

Add binary indicator \(I_{\text{missing}}\) alongside imputed value: \[\text{Features: } [x_{\text{imputed}}, I_{\text{missing}}]\]

Allows model to learn that missingness itself may be informative (e.g., "refused to answer" signals something).

14.3 Model-Based Imputation

K-Nearest Neighbors (KNN) Imputation:

For each missing value in row \(i\):

  1. Find \(K\) most similar rows (using observed features only)

  2. Impute as weighted average of neighbors’ values: \[x_i^{\text{(missing)}} = \frac{\sum_{j \in \text{neighbors}} w_{ij} x_j}{\sum_{j \in \text{neighbors}} w_{ij}}\] where \(w_{ij} = \frac{1}{\|x_i - x_j\|}\) or \(w_{ij} = 1\) (uniform)

Pros: Captures local structure, no distributional assumptions

Cons: Computationally expensive for large datasets, sensitive to choice of \(K\) and distance metric

Regression Imputation:

Train regression model to predict missing feature from observed features: \[\hat{x}_{\text{missing}} = f_\theta(x_{\text{observed}})\]

Can use linear regression, decision trees, or neural networks.

Pros: Leverages feature correlations, can be very accurate

Cons: Overfits if not careful (use cross-validation), underestimates uncertainty (point estimates only)

EM-Based Imputation:

Treat missing values as latent variables:

  1. E-step: Compute \(\mathbb{E}[x_{\text{missing}} \mid x_{\text{observed}}, \theta]\) under current model

  2. M-step: Update model parameters \(\theta\) using completed data

  3. Iterate until convergence

Example: For multivariate Gaussian data, EM recovers maximum likelihood estimates even with missing values.

Multiple Imputation:

Instead of single imputed value, generate \(M\) plausible values (e.g., \(M=5\)):

  1. Impute \(M\) complete datasets (e.g., via EM with different initializations)

  2. Train model on each dataset separately

  3. Pool results using Rubin’s rules (combines estimates and uncertainties)

Advantage: Properly accounts for imputation uncertainty (variance includes between-imputation variability).

14.4 Practical Decision Framework

Scenario Method Rationale
\(<\) 5% missing, MCAR Delete rows Minimal bias, simplest
Categorical, few categories Mode imputation Preserves distribution
Continuous, high correlation with other features Regression / KNN Leverages relationships
Time series Forward/backward fill Temporal smoothness
MAR, need uncertainty quantification Multiple imputation Accounts for imputation variance
MNAR Domain-specific model or sensitivity analysis Standard methods fail

Impact on Model Performance:

  • Bias: Simple methods (mean) can bias estimates if not MCAR

  • Variance: Single imputation underestimates variance (treats imputed values as certain)

  • Power: Deletion reduces sample size \(\to\) less statistical power

Note

Interview Question: "You have a dataset with 20% missing values in the ‘income’ column. Missingness correlates with education level. How would you handle this?"

Strong Answer:

  • Identify mechanism: This is MAR (missingness depends on observed education, not unobserved income itself)

  • Avoid deletion: Would lose 20% of data and bias results

  • Avoid simple mean imputation: Ignores education-income correlation

  • Recommended approach:

    • Use regression imputation with education (and other features) as predictors

    • Or KNN imputation to find similar individuals by education

    • Add missingness indicator if "refusal to answer" might be informative

    • Validate on holdout set with complete data if available

  • Advanced: Use multiple imputation if inference/confidence intervals are needed

15 Interview Questions & Key Concepts

15.1 Common Interview Questions

  1. Q: Explain bias-variance trade-off.

    A: Bias measures how far model’s average prediction is from truth (underfitting). Variance measures prediction variability across training sets (overfitting). Total error = Bias\(^2\) + Variance + Noise. Regularization increases bias slightly to reduce variance significantly.

  2. Q: When would you use L1 vs L2 regularization?

    A: L1 (Lasso) produces sparse solutions–use for feature selection or when many features are irrelevant. L2 (Ridge) shrinks all coefficients–use when all features are relevant or with multicollinearity. Elastic Net combines both.

  3. Q: What is the kernel trick?

    A: Replace inner products \(x_i^\top x_j\) with kernel \(k(x_i, x_j) = \phi(x_i)^\top \phi(x_j)\) to implicitly work in high-dimensional feature space without computing \(\phi\) explicitly. Enables non-linear decision boundaries with linear algorithms.

  4. Q: Explain the EM algorithm.

    A: Iterative method for MLE with latent variables. E-step: Compute expected complete-data log-likelihood given current parameters. M-step: Maximize this expectation w.r.t. parameters. Guaranteed to increase likelihood each iteration (converges to local optimum).

  5. Q: PCA vs ICA–when to use each?

    A: PCA finds orthogonal directions of maximum variance (decorrelates data). ICA finds statistically independent components (non-Gaussian). Use PCA for dimensionality reduction / noise removal. Use ICA for blind source separation (cocktail party problem).

  6. Q: Why does Random Forest reduce variance?

    A: Averaging reduces variance: \(\text{Var}(\bar{X}) = \frac{\sigma^2}{n}\) for independent \(X_i\). Trees are correlated, so RF adds feature randomness to decorrelate. Result: lower variance than single tree without increasing bias.

  7. Q: Bagging vs Boosting?

    A: Bagging trains models in parallel on bootstrap samples, reduces variance (Random Forest). Boosting trains sequentially, each model corrects previous errors, reduces bias (Gradient Boosting). Boosting more accurate but easier to overfit.

  8. Q: What are conjugate priors and why use them?

    A: Prior is conjugate to likelihood if posterior has same functional form as prior (e.g., Beta-Bernoulli). Simplifies Bayesian inference–closed-form posterior updates, interpretable hyperparameters as pseudo-counts.

  9. Q: Derive logistic regression from exponential family.

    A: Bernoulli is exponential family with natural parameter \(\eta = \log\frac{p}{1-p}\) (log-odds). GLM framework: \(\eta = X\theta\). Invert: \(p = \frac{1}{1+e^{-X\theta}} = \sigma(X\theta)\). Logit is canonical link for Bernoulli.

  10. Q: How does SVM handle non-linear boundaries?

    A: Via kernel trick. RBF kernel \(k(x,x') = \exp(-\gamma\|x-x'\|^2)\) maps to infinite-dimensional space where data becomes linearly separable. SVM finds maximum-margin hyperplane in this space.

15.2 Key Takeaways for Interviews

Note

Must-Know Concepts:

  • ERM Framework: Empirical risk minimization, bias-variance decomposition

  • Regularization: L1 (sparsity), L2 (shrinkage), when to use each

  • Exponential Families: Gaussian, Bernoulli, Poisson; canonical links

  • Conjugate Priors: Beta-Bernoulli, Dirichlet-Categorical, Gamma-Poisson

  • EM Algorithm: E-step (posterior), M-step (parameter update), ELBO

  • PCA: Eigendecomposition of covariance, variance explained

  • SVM: Maximum margin, kernel trick, support vectors

  • Random Forest: Bagging + feature randomness, variance reduction

  • Gradient Boosting: Sequential error correction, functional gradient descent

Note

Common Pitfalls:

  • Confusing L1/L2 regularization effects (L1 → sparse, L2 → smooth)

  • Forgetting to standardize features for distance-based methods (k-means, SVM with RBF)

  • Using PCA when features have different units (standardize first!)

  • Thinking EM finds global optimum (only local–needs good initialization)

  • Forgetting soft-margin SVM parameter \(C\) (larger \(C\) → less regularization)

16 Quick Reference

16.1 Optimization Basics

Gradient Descent: \(\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)\)

Stochastic GD: Use single sample gradient (noisy but faster)

Momentum: \(v_{t+1} = \beta v_t + \nabla L(\theta_t)\), \(\theta_{t+1} = \theta_t - \eta v_{t+1}\)

Adam: Adaptive learning rates + momentum (first and second moment estimates)

16.2 Convexity

Convex function: \(f(\lambda x + (1-\lambda)y) \le \lambda f(x) + (1-\lambda) f(y)\)

Strongly convex: \(f(y) \ge f(x) + \nabla f(x)^\top(y-x) + \frac{\mu}{2}\|y-x\|^2\)

Why care: Convex optimization has unique global minimum (GD converges). Non-convex (deep learning) has many local minima.

16.3 Key Inequalities

Jensen’s: For convex \(f\), \(f(\mathbb{E}(X)) \le \mathbb{E}(f(X))\)

Cauchy-Schwarz: \(|x^\top y| \le \|x\| \|y\|\)

Hoeffding: Concentration inequality for bounded random variables

17 References & Further Reading

  • Bishop (2006): Pattern Recognition and Machine Learning–comprehensive ML textbook

  • Murphy (2012): Machine Learning: A Probabilistic Perspective–Bayesian approach

  • Hastie et al. (2009): Elements of Statistical Learning–statistical perspective, free online

  • James et al. (2013): An Introduction to Statistical Learning–accessible intro with R code

  • Deisenroth et al. (2020): Mathematics for Machine Learning–linear algebra, probability, optimization


  1. Leemis, L.M. & McQueston, J.T. (2008). Univariate Distribution Relationships. The American Statistician, 62(1), 45-53.↩︎