3 Chapter 2: Sequential & State-Space Models

4 Introduction: Modeling Sequential Data

Sequential data exhibits temporal or spatial dependencies: \(x_1, x_2, \ldots, x_T\) where order matters.

Key Challenges:

Variable length: Sequences can be arbitrarily long
Dependencies: Current state depends on history
Latent dynamics: Underlying state may be hidden (unobserved)

This Document Covers:

Hidden Markov Models (HMMs): Discrete latent states, discrete/continuous observations
Kalman Filters: Continuous state-space models with Gaussian distributions
Particle Filters: Non-parametric sequential Monte Carlo methods
Markov Decision Processes (MDPs): Sequential decision-making framework
Reinforcement Learning: Q-learning, policy gradients, value functions

5 Hidden Markov Models (HMMs)

5.1 Model Definition

HMM models sequences with hidden (latent) states \(z_t\) and observed outputs \(x_t\).

Assumptions:

Markov property: \(p(z_t \mid z_{1:t-1}) = p(z_t \mid z_{t-1})\) (current state depends only on previous)
Observation independence: \(p(x_t \mid z_{1:T}, x_{1:t-1}) = p(x_t \mid z_t)\) (observation depends only on current state)

Model Parameters:

Initial distribution: \(\pi_i = p(z_1 = i)\)
Transition probabilities: \(A_{ij} = p(z_t = j \mid z_{t-1} = i)\)
Emission probabilities: \(B_j(x) = p(x_t = x \mid z_t = j)\)

Joint Distribution: \[p(x_{1:T}, z_{1:T}) = p(z_1) p(x_1 \mid z_1) \prod_{t=2}^T p(z_t \mid z_{t-1}) p(x_t \mid z_t)\]

Graphical Model:

5.2 The Three Fundamental Problems

Problem 1: Evaluation (Filtering/Smoothing)

Given observations \(x_{1:T}\) and model parameters, compute:

Filtering: \(p(z_t \mid x_{1:t})\) (current state given past observations)
Smoothing: \(p(z_t \mid x_{1:T})\) (state at time \(t\) given all observations)
Prediction: \(p(z_{t+k} \mid x_{1:t})\) (future state)
Likelihood: \(p(x_{1:T})\) (probability of observation sequence)

Solved by Forward-Backward Algorithm.

Problem 2: Decoding

Find most likely state sequence: \[z_{1:T}^* = \operatorname*{arg\,max}_{z_{1:T}} p(z_{1:T} \mid x_{1:T})\]

Solved by Viterbi Algorithm.

Problem 3: Learning

Given observations \(x_{1:T}\), learn parameters \(\theta = (\pi, A, B)\): \[\theta^* = \operatorname*{arg\,max}_\theta p(x_{1:T} \mid \theta)\]

Solved by Baum-Welch Algorithm (special case of EM).

5.3 Forward-Backward Algorithm

Forward Pass (Filtering):

Define forward variable: \(\alpha_t(i) = p(x_{1:t}, z_t = i)\)

Recursion: \[\begin{align*} \alpha_1(i) & = \pi_i B_i(x_1) \\ \alpha_t(j) & = B_j(x_t) \sum_{i=1}^K \alpha_{t-1}(i) A_{ij} \end{align*}\]

Likelihood: \(p(x_{1:T}) = \sum_{i=1}^K \alpha_T(i)\)

Backward Pass:

Define backward variable: \(\beta_t(i) = p(x_{t+1:T} \mid z_t = i)\)

Recursion: \[\begin{align*} \beta_T(i) & = 1 \\ \beta_t(i) & = \sum_{j=1}^K A_{ij} B_j(x_{t+1}) \beta_{t+1}(j) \end{align*}\]

Smoothing: \[p(z_t = i \mid x_{1:T}) = \frac{\alpha_t(i) \beta_t(i)}{\sum_{j=1}^K \alpha_t(j) \beta_t(j)}\]

Complexity: \(O(K^2 T)\) (vs naive \(O(K^T)\))

5.4 Viterbi Algorithm

Find most likely path using dynamic programming.

Define: \(\delta_t(i) = \max_{z_{1:t-1}} p(z_{1:t-1}, z_t = i, x_{1:t})\)

Recursion: \[\begin{align*} \delta_1(i) & = \pi_i B_i(x_1) \\ \delta_t(j) & = B_j(x_t) \max_{i} \delta_{t-1}(i) A_{ij} \end{align*}\]

Store backpointers: \(\psi_t(j) = \operatorname*{arg\,max}_i \delta_{t-1}(i) A_{ij}\)

Backtrack: \[\begin{align*} z_T^* & = \operatorname*{arg\,max}_i \delta_T(i) \\ z_t^* & = \psi_{t+1}(z_{t+1}^*) \quad \text{for } t = T-1, \ldots, 1 \end{align*}\]

5.5 Baum-Welch Algorithm (EM for HMMs)

E-step: Compute expected sufficient statistics using forward-backward:

\[\gamma_t(i) = p(z_t = i \mid x_{1:T})\] \[\xi_t(i,j) = p(z_t = i, z_{t+1} = j \mid x_{1:T}) = \frac{\alpha_t(i) A_{ij} B_j(x_{t+1}) \beta_{t+1}(j)}{p(x_{1:T})}\]

M-step: Update parameters: \[\begin{align*} \pi_i & = \gamma_1(i) \\ A_{ij} & = \frac{\sum_{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)} \\ B_j(x_k) & = \frac{\sum_{t: x_t = x_k} \gamma_t(j)}{\sum_{t=1}^T \gamma_t(j)} \end{align*}\]

Example

HMM Applications:

Speech recognition: States = phonemes, observations = acoustic features
Part-of-speech tagging: States = POS tags, observations = words
Bioinformatics: States = gene structure (exon/intron), observations = DNA sequence
Finance: States = market regimes (bull/bear), observations = returns

6 Kalman Filters

6.1 Linear Gaussian State-Space Model

Kalman Filter is optimal Bayesian filter for linear-Gaussian systems.

State-Space Model: \[\begin{align} z_t & = A z_{t-1} + B u_t + w_t, \quad w_t \sim \mathcal{N}(0, Q) \quad \text{(state transition)} \\ x_t & = C z_t + v_t, \quad v_t \sim \mathcal{N}(0, R) \quad \text{(observation)} \end{align}\]

where:

\(z_t \in \mathbb{R}^n\): latent state (unobserved)
\(x_t \in \mathbb{R}^m\): observation (measured)
\(u_t \in \mathbb{R}^p\): control input (optional)
\(A\): state transition matrix
\(B\): control input matrix
\(C\): observation matrix
\(Q\): process noise covariance
\(R\): observation noise covariance

Key Insight: If prior is Gaussian and dynamics are linear-Gaussian, posterior remains Gaussian → closed-form updates!

6.2 Kalman Filter Recursion

Maintain Gaussian belief: \(p(z_t \mid x_{1:t}) = \mathcal{N}(z_t \mid \mu_t, \Sigma_t)\)

Predict Step (Time Update): \[\begin{align*} \bar{\mu}_t & = A \mu_{t-1} + B u_t \quad \text{(predicted state mean)} \\ \bar{\Sigma}_t & = A \Sigma_{t-1} A^\top + Q \quad \text{(predicted covariance)} \end{align*}\]

Update Step (Measurement Update): \[\begin{align*} K_t & = \bar{\Sigma}_t C^\top (C \bar{\Sigma}_t C^\top + R)^{-1} \quad (\text{Kalman gain} = \Sigma_{ZX}\Sigma_{XX}^{-1}) \\ \mu_t & = \bar{\mu}_t + K_t (x_t - C \bar{\mu}_t) \quad \text{(corrected mean)} \\ \Sigma_t & = (I - K_t C) \bar{\Sigma}_t \quad (\text{corrected covariance} = \Sigma_{ZZ} - \Sigma_{ZX}\Sigma_{XX}^{-1}\Sigma_{XZ}) \end{align*}\]

Innovation: \(y_t = x_t - C \bar{\mu}_t\) (prediction error)

Kalman Gain Interpretation:

\(K_t \approx 0\): High observation noise (\(R\) large) → trust prediction
\(K_t \approx C^{-1}\): Low observation noise (\(R\) small) → trust measurement

Note

Joseph Form:

Numerical stability: Floating point errors can cause \(\Sigma_t\) to lose positive definiteness.
Joseph form of covariance update maintains symmetry and positive definiteness: \[\Sigma_t = (I - K_t C) \bar{\Sigma}_t (I - K_t C)^\top + K_t R K_t^\top\]

6.3 Steady-State Kalman Filter and the Riccati Equation

For time-invariant systems (constant \(A, C, Q, R\)), the covariance \(\Sigma_t\) converges to a steady-state value \(\Sigma_\infty\) as \(t \to \infty\).

Discrete Algebraic Riccati Equation (DARE): \[\Sigma_\infty = A \Sigma_\infty A^\top + Q - A \Sigma_\infty C^\top (C \Sigma_\infty C^\top + R)^{-1} C \Sigma_\infty A^\top\]

This is a fixed-point equation: \(\Sigma_\infty = f(\Sigma_\infty)\) where \(f\) is the covariance update.

Practical Implications:

Offline computation: Solve DARE once offline (e.g., via iterative methods or control toolboxes)
Runtime efficiency: Use constant \(\Sigma_\infty\) and \(K_\infty = \Sigma_\infty C^\top (C \Sigma_\infty C^\top + R)^{-1}\)
No covariance updates: Only update state mean \(\mu_t\) at runtime (saves matrix operations)
Common in practice: Most deployed systems use steady-state KF for computational efficiency

When Steady-State KF Applies:

Time-invariant dynamics (\(A, C, Q, R\) constant)
System is observable and controllable (DARE solution exists and is unique)
Long-running applications where transient behavior is negligible (tracking, navigation)

Connection to LQG Control: DARE also appears in Linear-Quadratic-Gaussian (LQG) optimal control–Kalman filter is the optimal state estimator, and DARE solution gives the steady-state estimation error covariance.

6.4 Extended Kalman Filter (EKF)

For nonlinear dynamics: \[\begin{align*} z_t & = f(z_{t-1}, u_t) + w_t \\ x_t & = h(z_t) + v_t \end{align*}\]

Linearize around current estimate via first-order Taylor expansion: \[A_t = \frac{\partial f}{\partial z}\bigg|_{z=\mu_{t-1}}, \quad C_t = \frac{\partial h}{\partial z}\bigg|_{z=\bar{\mu}_t}\]

Then apply standard Kalman filter with time-varying \(A_t, C_t\).

Limitations: EKF can fail for highly nonlinear systems (linearization error accumulates).

6.5 Unscented Kalman Filter (UKF)

Idea: Approximate Gaussian distribution by sigma points–deterministically chosen samples that capture mean and covariance.

Key Insight: In the Kalman filter, covariance evolution is independent of actual observations: \[\bar{\Sigma}_{t+1} = A \Sigma_t A^\top + Q\] This depends only on the previous covariance \(\Sigma_t\), not on \(x_t\)! To propagate uncertainty through nonlinear \(f\), we only need to track how the distribution of errors transforms.

UKF’s Solution: Sample the input distribution, propagate samples through \(f\), estimate output statistics.

Generate sigma points \(\mathcal{X}^{(i)}\) around mean \(\mu\) with covariance \(\Sigma\):
- \(\mathcal{X}^{(0)} = \mu\) (center point)
- \(\mathcal{X}^{(i)} = \mu + \sqrt{(n+\lambda)\Sigma}_i\) for \(i=1,\ldots,n\) (spread along principal axes)
- \(\mathcal{X}^{(i)} = \mu - \sqrt{(n+\lambda)\Sigma}_{i-n}\) for \(i=n+1,\ldots,2n\)
where \(\sqrt{\Sigma}\) is Cholesky decomposition, \(\lambda\) is tuning parameter.
Propagate each sigma point: \(\mathcal{Y}^{(i)} = f(\mathcal{X}^{(i)})\) (no linearization!)
Estimate output statistics from transformed samples: \[\begin{align*} \bar{\mu} & = \sum_{i=0}^{2n} w_i^{(m)} \mathcal{Y}^{(i)} \\ \bar{\Sigma} & = \sum_{i=0}^{2n} w_i^{(c)} (\mathcal{Y}^{(i)} - \bar{\mu})(\mathcal{Y}^{(i)} - \bar{\mu})^\top \end{align*}\] where \(w_i\) are weights (chosen to match mean and covariance exactly).
Apply same process to observation model \(h\) for measurement update

Why This Works:

Sigma points capture 1st and 2nd moments (mean, covariance) exactly for Gaussian
Propagating 2n+1 points captures curvature of \(f\) better than single linearization point (EKF)
Accurate up to 3rd-order moments for polynomial nonlinearities
Extension: Could use more sophisticated sampling (e.g., cubature rules, higher-order unscented transforms) if needed

Advantages over EKF:

No Jacobian computation (derivative-free)
Better approximation for highly nonlinear functions
More robust to nonlinearity
Numerical stability (Cholesky decomposition maintains positive-definiteness)

6.6 Kalman Smoother (RTS Smoother)

Compute smoothed estimate \(p(z_t \mid x_{1:T})\) (uses all data, not just past).

Forward pass: Standard Kalman filter → \(\mu_t, \Sigma_t\)

Backward pass: Rauch-Tung-Striebel (RTS) recursion: \[\begin{align*} G_t & = \Sigma_t A^\top \bar{\Sigma}_{t+1}^{-1} \\ \mu_t^s & = \mu_t + G_t (\mu_{t+1}^s - \bar{\mu}_{t+1}) \\ \Sigma_t^s & = \Sigma_t + G_t (\Sigma_{t+1}^s - \bar{\Sigma}_{t+1}) G_t^\top \end{align*}\]

Smoother has lower uncertainty than filter: \(\Sigma_t^s \preceq \Sigma_t\) (in PSD sense).

Note

When to Use Kalman Filters:

Tracking: Object tracking in video, GPS/IMU sensor fusion
Control: Robotics, autonomous vehicles (state estimation for controller)
Signal processing: Noise reduction, prediction in time series
Economics: State-space models for GDP, inflation, etc.

Assumptions Required:

Linear dynamics (or use EKF/UKF for nonlinear)
Gaussian noise
Known noise covariances \(Q, R\)

Example

Classic Example: Tracking a Moving Object

State: \(z_t = [x, y, \dot{x}, \dot{y}]^\top\) (position + velocity)

Observation: \(x_t = [x_{\text{meas}}, y_{\text{meas}}]^\top\) (noisy GPS measurements)

Constant velocity model: \[A = \begin{bmatrix} 1 & 0 & \Delta t & 0 \\ 0 & 1 & 0 & \Delta t \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}, \quad C = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{bmatrix}\]

Kalman filter fuses noisy measurements with motion model to estimate true position/velocity.

7 Particle Filters (Sequential Monte Carlo)

7.1 Motivation: Beyond Gaussians

Limitation of Kalman Filters: Require Gaussian distributions.

Particle Filter (PF) represents belief \(p(z_t \mid x_{1:t})\) by weighted samples (particles): \[p(z_t \mid x_{1:t}) \approx \sum_{i=1}^N w_t^{(i)} \delta(z_t - z_t^{(i)})\]

where \(\{z_t^{(i)}, w_t^{(i)}\}_{i=1}^N\) are particles and weights.

Advantages:

Handles arbitrary distributions (multimodal, non-Gaussian)
Nonlinear dynamics without linearization
Asymptotically converges to true posterior as \(N \to \infty\)

7.2 Sequential Importance Sampling (SIS)

Importance Sampling: To estimate \(\mathbb{E}(f(z))\) under \(p(z)\), sample from proposal \(q(z)\): \[\mathbb{E}(f(z)) = \int f(z) p(z) dz = \int f(z) \frac{p(z)}{q(z)} q(z) dz \approx \frac{1}{N} \sum_{i=1}^N w^{(i)} f(z^{(i)})\] where \(z^{(i)} \sim q(z)\) and \(w^{(i)} = \frac{p(z^{(i)})}{q(z^{(i)})}\) (importance weight).

Sequential Update:

At time \(t\), propose new particle: \(z_t^{(i)} \sim q(z_t \mid z_{t-1}^{(i)}, x_t)\)

Update weight: \[w_t^{(i)} \propto w_{t-1}^{(i)} \frac{p(x_t \mid z_t^{(i)}) p(z_t^{(i)} \mid z_{t-1}^{(i)})}{q(z_t^{(i)} \mid z_{t-1}^{(i)}, x_t)}\]

Normalize: \(w_t^{(i)} \leftarrow \frac{w_t^{(i)}}{\sum_j w_t^{(j)}}\)

7.3 Bootstrap Particle Filter

Simplest choice: Use transition prior as proposal: \(q(z_t \mid z_{t-1}, x_t) = p(z_t \mid z_{t-1})\)

Weight update simplifies to: \[w_t^{(i)} \propto w_{t-1}^{(i)} p(x_t \mid z_t^{(i)})\]

Algorithm:

Sample \(z_t^{(i)} \sim p(z_t \mid z_{t-1}^{(i)})\) (predict)
Compute \(w_t^{(i)} \propto w_{t-1}^{(i)} p(x_t \mid z_t^{(i)})\) (weight by likelihood)
Normalize weights
Resample if needed (see below)

7.4 Degeneracy Problem & Resampling

Problem: After a few iterations, most weights become negligible–only few particles contribute (degeneracy).

Solution: Resampling–duplicate high-weight particles, discard low-weight ones.

Effective Sample Size (ESS): \[N_{\text{eff}} = \frac{1}{\sum_{i=1}^N (w_t^{(i)})^2}\]

If \(N_{\text{eff}} < N_{\text{thresh}}\) (e.g., \(N/2\)), resample.

Resampling Methods:

Multinomial: Draw \(N\) samples with replacement according to weights
Systematic: Lower variance, deterministic spacing
Residual: Hybrid approach

After resampling, reset weights: \(w_t^{(i)} = 1/N\).

Note

Particle Filters vs Kalman Filters:

Kalman: Exact for linear-Gaussian, \(O(n^3)\) per step (matrix ops)
Particle: Approximate, arbitrary nonlinear/non-Gaussian, \(O(N)\) per step
PF needs many particles (\(N \sim 100\)-10000) → computationally expensive
Use Kalman when applicable (faster, exact); PF when Kalman assumptions fail

8 Markov Decision Processes (MDPs)

8.1 MDP Framework

MDP models sequential decision-making with stochastic transitions.

Components:

\(\mathcal{S}\): state space
\(\mathcal{A}\): action space
\(P(s' \mid s, a)\): transition probability (dynamics)
\(R(s, a)\): reward function
\(\gamma \in [0,1)\): discount factor

Goal: Find policy \(\pi: \mathcal{S} \to \mathcal{A}\) (or \(\pi(a \mid s)\) for stochastic) maximizing expected cumulative reward: \[J(\pi) = \mathbb{E}(\sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid \pi)\]

8.2 Value Functions

State value function: Expected return starting from state \(s\) following policy \(\pi\): \[V^\pi(s) = \mathbb{E}(\sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid s_0 = s, \pi)\]

Action value function (Q-function): Expected return starting from \((s, a)\): \[Q^\pi(s, a) = \mathbb{E}(\sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid s_0 = s, a_0 = a, \pi)\]

Relationship: \(V^\pi(s) = \sum_a \pi(a \mid s) Q^\pi(s, a)\) (for stochastic \(\pi\))

8.3 Bellman Equations

Bellman Expectation Equation: \[\begin{align*} V^\pi(s) & = \sum_a \pi(a \mid s) \left[R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V^\pi(s')\right] \\ Q^\pi(s,a) & = R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) \sum_{a'} \pi(a' \mid s') Q^\pi(s', a') \end{align*}\]

Bellman Optimality Equation: \[\begin{align*} V^*(s) & = \max_a \left[R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V^*(s')\right] \\ Q^*(s,a) & = R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) \max_{a'} Q^*(s', a') \end{align*}\]

Optimal policy: \(\pi^*(s) = \operatorname*{arg\,max}_a Q^*(s,a)\)

8.4 Dynamic Programming Methods

Assumption: Model is known (\(P, R\) available).

Policy Evaluation: Compute \(V^\pi\) for given \(\pi\).

Iterative update: \[V_{k+1}(s) = \sum_a \pi(a \mid s) \left[R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V_k(s')\right]\]

Converges to \(V^\pi\) as \(k \to \infty\).

Value Iteration: Compute optimal value function \(V^*\).

\[V_{k+1}(s) = \max_a \left[R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V_k(s')\right]\]

Converges to \(V^*\). Extract policy: \(\pi(s) = \operatorname*{arg\,max}_a Q^*(s,a)\).

Policy Iteration:

Policy Evaluation: Compute \(V^\pi\) (solve Bellman expectation)
Policy Improvement: Update \(\pi(s) \leftarrow \operatorname*{arg\,max}_a Q^\pi(s,a)\)
Repeat until convergence

Often converges faster than value iteration (fewer iterations, but each step more expensive).

Note

DP Complexity:

Value iteration: \(O(|\mathcal{S}|^2 |\mathcal{A}|)\) per iteration
Policy iteration: \(O(|\mathcal{S}|^3 + |\mathcal{S}|^2 |\mathcal{A}|)\) per iteration
Feasible only for small/medium state spaces (\(|\mathcal{S}| < 10^6\))
For large/continuous spaces → function approximation (RL)

9 Reinforcement Learning

9.1 The RL Problem

Key Difference from Supervised Learning:

No labeled data \((s,a,R)\) provided
Agent must explore environment to discover rewards
Exploration-exploitation trade-off: Try new actions vs exploit known good actions
Credit assignment: Which action led to reward?

Model-Free RL: Learn policy/value function without learning transition model \(P\).

9.2 Q-Learning (Off-Policy TD Control)

Idea: Learn \(Q^*(s,a)\) directly from experience without model.

Temporal Difference (TD) Update: \[Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[R_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right]\]

where:

\(\alpha\): learning rate
\(R_t + \gamma \max_{a'} Q(s_{t+1}, a')\): TD target (bootstrap estimate)
\(R_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\): TD error

Q-Learning Algorithm:

Initialize \(Q(s,a)\) arbitrarily (e.g., zeros) Initialize state \(s\) Choose action \(a\) using \(\varepsilon\)-greedy policy from \(Q\) Take action \(a\), observe \(r, s'\) \(Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]\) \(s \leftarrow s'\)

\(\varepsilon\)-greedy Exploration: \[a = \begin{cases} \operatorname*{arg\,max}_{a'} Q(s, a'), & \text{with probability } 1-\varepsilon \\ \text{random action}, & \text{with probability } \varepsilon \end{cases}\]

Off-Policy: Q-learning learns optimal policy while following exploratory policy.

Convergence: Guaranteed if all \((s,a)\) pairs visited infinitely often and learning rate decays appropriately (\(\sum_t \alpha_t = \infty\), \(\sum_t \alpha_t^2 < \infty\)).

9.3 SARSA (On-Policy TD Control)

Similar to Q-learning but updates based on action actually taken (not max): \[Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[R_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\right]\]

On-Policy: Learns value of policy being followed (e.g., \(\varepsilon\)-greedy).

Q-Learning vs SARSA:

Q-learning: More aggressive, learns optimal policy, can be unstable
SARSA: More conservative, learns safe policy under exploration, more stable

9.4 Deep Q-Networks (DQN)

For large state spaces (e.g., images), represent \(Q(s,a; \theta)\) with neural network.

Key Innovations:

Experience Replay: Store transitions \((s, a, r, s')\) in replay buffer \(\mathcal{D}\). Sample random minibatch to break correlations.
Target Network: Use separate network \(Q(s,a; \theta^-)\) for TD target (updated periodically). Stabilizes training.

Loss Function: \[L(\theta) = \mathbb{E}((r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta))^2)\]

DQN Algorithm:

Initialize \(Q(s,a; \theta)\) and target \(Q(s,a; \theta^-)\) Initialize replay buffer \(\mathcal{D}\) Select action \(a\) using \(\varepsilon\)-greedy Execute \(a\), observe \(r, s'\) Store \((s, a, r, s')\) in \(\mathcal{D}\) Sample random minibatch from \(\mathcal{D}\) Compute TD target: \(y = r + \gamma \max_{a'} Q(s', a'; \theta^-)\) Update \(\theta\) by minimizing \((y - Q(s, a; \theta))^2\) Every \(C\) steps: \(\theta^- \leftarrow \theta\)

9.5 Policy Gradient Methods

Idea: Directly parameterize policy \(\pi(a \mid s; \theta)\) and optimize via gradient ascent.

Objective: \[J(\theta) = \mathbb{E}(\sum_{t=0}^\infty \gamma^t R_t \mid \pi_\theta)\]

Policy Gradient Theorem: \[\nabla_\theta J(\theta) = \mathbb{E}(\nabla_\theta \log \pi(a \mid s; \theta) Q^\pi(s,a))\]

REINFORCE Algorithm:

Monte Carlo estimate of gradient: \[\nabla_\theta J(\theta) \approx \sum_t \nabla_\theta \log \pi(a_t \mid s_t; \theta) G_t\] where \(G_t = \sum_{k=t}^T \gamma^{k-t} R_k\) (return from time \(t\)).

Update: \(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\)

Variance Reduction (Baseline):

Subtract baseline \(b(s)\) (typically \(V(s)\)): \[\nabla_\theta J(\theta) \approx \sum_t \nabla_\theta \log \pi(a_t \mid s_t; \theta) (G_t - b(s_t))\]

Advantage function: \(A(s,a) = Q(s,a) - V(s)\)

9.6 Actor-Critic Methods

Combine value-based and policy-based methods:

Actor: Policy \(\pi(a \mid s; \theta)\) (chooses actions)
Critic: Value function \(V(s; w)\) or \(Q(s,a; w)\) (evaluates actions)

Advantage Actor-Critic (A2C):

Policy gradient: \(\nabla_\theta J \approx \nabla_\theta \log \pi(a_t \mid s_t; \theta) A_t\)

where \(A_t = r_t + \gamma V(s_{t+1}; w) - V(s_t; w)\) (TD error as advantage estimate).

Update actor: \(\theta \leftarrow \theta + \alpha_\theta \nabla_\theta \log \pi(a \mid s; \theta) A_t\)

Update critic: \(w \leftarrow w + \alpha_w A_t \nabla_w V(s; w)\)

Modern Algorithms:

A3C (Asynchronous Advantage Actor-Critic): Parallel actors
PPO (Proximal Policy Optimization): Clipped objective for stable updates
SAC (Soft Actor-Critic): Maximum entropy RL (exploration bonus)
TD3 (Twin Delayed DDPG): For continuous control

Note

Value-Based (Q-Learning) vs Policy-Based (PG):

Q-Learning: Better sample efficiency, works for discrete actions only, can be unstable
Policy Gradient: Works for continuous actions, more stable, higher variance
Actor-Critic: Best of both–stable, efficient, works for continuous actions

10 Interview Questions & Key Concepts

10.1 Common Interview Questions

Q: What’s the difference between HMM and Kalman filter?

A: HMM has discrete states, arbitrary emission distributions. Kalman filter has continuous states, linear-Gaussian dynamics. HMM uses forward-backward/Viterbi (sum/max over discrete states). Kalman uses matrix operations (closed-form Gaussian updates).
Q: Explain the three HMM problems.

A: (1) Evaluation: Compute \(p(x_{1:T})\) or \(p(z_t \mid x_{1:T})\) (forward-backward). (2) Decoding: Find best state sequence \(\operatorname*{arg\,max} p(z_{1:T} \mid x_{1:T})\) (Viterbi). (3) Learning: Estimate parameters from data (Baum-Welch/EM).
Q: When would you use EKF vs UKF vs particle filter?

A: EKF for mildly nonlinear systems (fast, needs Jacobians). UKF for moderately nonlinear (no derivatives, better than EKF). Particle filter for highly nonlinear or non-Gaussian (arbitrary distributions, computationally expensive).
Q: What is the Kalman gain and what does it represent?

A: \(K = \bar{\Sigma} C^\top (C \bar{\Sigma} C^\top + R)^{-1}\) balances prediction vs measurement. High \(K\) → trust measurement (low \(R\)). Low \(K\) → trust prediction (high \(R\) or low \(\bar{\Sigma}\)). It’s the optimal gain minimizing posterior covariance.
Q: Explain the Bellman equation.

A: Recursive relationship: Value of state = immediate reward + discounted value of next state. \(V(s) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s')\). Foundation of dynamic programming and RL. Optimality version uses max over actions.
Q: Q-learning vs SARSA?

A: Q-learning is off-policy (learns optimal \(Q^*\) while exploring). SARSA is on-policy (learns value of policy being followed). Q-learning more aggressive, SARSA more conservative. Q-learning update uses \(\max_a Q(s',a)\); SARSA uses \(Q(s', a')\) where \(a'\) is actually taken.
Q: Why does DQN need experience replay?

A: Consecutive samples are highly correlated (violates i.i.d. assumption). Replay buffer breaks correlations, improves data efficiency (reuse experiences), stabilizes training. Also helps with catastrophic forgetting.
Q: What’s the exploration-exploitation trade-off?

A: Exploitation: Choose known best action (maximize immediate reward). Exploration: Try new actions (might find better strategy). Too much exploitation → stuck in local optimum. Too much exploration → never converge. Balance with \(\varepsilon\)-greedy, UCB, or entropy regularization.
Q: Policy gradient vs value-based methods?

A: Value-based (Q-learning) learns \(Q(s,a)\), derives policy via \(\operatorname*{arg\,max}\)–works only for discrete actions. Policy gradient directly optimizes parameterized policy–works for continuous actions, more stable, but higher variance. Actor-critic combines both.
Q: How do particle filters work?

A: Represent belief by weighted samples (particles). (1) Predict: Propagate particles through dynamics. (2) Update: Weight by observation likelihood. (3) Resample when weights degenerate. Non-parametric, handles arbitrary distributions, but needs many particles.

10.2 Key Takeaways for Interviews

Note

Must-Know Concepts:

HMMs: Forward-backward (filtering/smoothing), Viterbi (decoding), Baum-Welch (learning)
Kalman Filter: Predict (propagate mean/cov), update (Kalman gain), linear-Gaussian assumption
EKF/UKF: Nonlinear extensions via linearization or sigma points
Particle Filters: Weighted samples, resampling, non-parametric
MDPs: States, actions, rewards, transitions, discount factor
Bellman Equations: Recursive decomposition of value functions
Q-Learning: Off-policy TD, \(\max\) operator, \(\varepsilon\)-greedy exploration
DQN: Experience replay, target network, neural Q-function
Policy Gradient: Direct policy optimization, REINFORCE, variance reduction
Actor-Critic: Combine policy (actor) and value (critic)

Note

Common Pitfalls:

Confusing filtering (\(p(z_t \mid x_{1:t})\)) with smoothing (\(p(z_t \mid x_{1:T})\))
Forgetting Kalman filter requires linear dynamics (use EKF/UKF otherwise)
Using Q-learning for continuous action spaces (use policy gradient or actor-critic)
Not understanding off-policy (Q-learning) vs on-policy (SARSA)
Ignoring exploration (greedy policies fail in RL)
Thinking DP solves large-scale problems (only feasible for small state spaces)

10.3 Algorithm Complexity Comparison

Algorithm	Time per Step	Space
HMM Forward-Backward	\(O(K^2 T)\)	\(O(KT)\)
Viterbi	\(O(K^2 T)\)	\(O(KT)\)
Kalman Filter	\(O(n^3)\)	\(O(n^2)\)
Particle Filter	\(O(N)\)	\(O(N)\)
Value Iteration	\(O(\|\mathcal{S}\|^2 \|\mathcal{A}\|)\)	\(O(\|\mathcal{S}\|)\)
Q-Learning (tabular)	\(O(1)\)	\(O(\|\mathcal{S}\| \|\mathcal{A}\|)\)
DQN	\(O(\text{forward pass})\)	\(O(\|\mathcal{D}\|)\) (replay buffer)

where \(K\) = number of states (HMM), \(n\) = state dimension (Kalman), \(N\) = number of particles.

11 References & Further Reading

Rabiner (1989): A Tutorial on Hidden Markov Models–classic HMM tutorial
Kalman (1960): A New Approach to Linear Filtering and Prediction Problems–original paper
Thrun et al. (2005): Probabilistic Robotics–Kalman, particle filters, SLAM
Sutton & Barto (2018): Reinforcement Learning: An Introduction–comprehensive RL textbook
Mnih et al. (2015): Human-level control through deep RL–DQN paper
Schulman et al. (2017): Proximal Policy Optimization–PPO paper
Murphy (2012): Machine Learning: A Probabilistic Perspective–Ch 17-18 (sequential models)