3 Chapter 2: Sequential & State-Space Models
4 Introduction: Modeling Sequential Data
Sequential data exhibits temporal or spatial dependencies: \(x_1, x_2, \ldots, x_T\) where order matters.
Key Challenges:
Variable length: Sequences can be arbitrarily long
Dependencies: Current state depends on history
Latent dynamics: Underlying state may be hidden (unobserved)
This Document Covers:
Hidden Markov Models (HMMs): Discrete latent states, discrete/continuous observations
Kalman Filters: Continuous state-space models with Gaussian distributions
Particle Filters: Non-parametric sequential Monte Carlo methods
Markov Decision Processes (MDPs): Sequential decision-making framework
Reinforcement Learning: Q-learning, policy gradients, value functions
6 Kalman Filters
6.1 Linear Gaussian State-Space Model
Kalman Filter is optimal Bayesian filter for linear-Gaussian systems.
State-Space Model: \[\begin{align} z_t & = A z_{t-1} + B u_t + w_t, \quad w_t \sim \mathcal{N}(0, Q) \quad \text{(state transition)} \\ x_t & = C z_t + v_t, \quad v_t \sim \mathcal{N}(0, R) \quad \text{(observation)} \end{align}\]
where:
\(z_t \in \mathbb{R}^n\): latent state (unobserved)
\(x_t \in \mathbb{R}^m\): observation (measured)
\(u_t \in \mathbb{R}^p\): control input (optional)
\(A\): state transition matrix
\(B\): control input matrix
\(C\): observation matrix
\(Q\): process noise covariance
\(R\): observation noise covariance
Key Insight: If prior is Gaussian and dynamics are linear-Gaussian, posterior remains Gaussian → closed-form updates!
6.2 Kalman Filter Recursion
Maintain Gaussian belief: \(p(z_t \mid x_{1:t}) = \mathcal{N}(z_t \mid \mu_t, \Sigma_t)\)
Predict Step (Time Update): \[\begin{align*} \bar{\mu}_t & = A \mu_{t-1} + B u_t \quad \text{(predicted state mean)} \\ \bar{\Sigma}_t & = A \Sigma_{t-1} A^\top + Q \quad \text{(predicted covariance)} \end{align*}\]
Update Step (Measurement Update): \[\begin{align*} K_t & = \bar{\Sigma}_t C^\top (C \bar{\Sigma}_t C^\top + R)^{-1} \quad (\text{Kalman gain} = \Sigma_{ZX}\Sigma_{XX}^{-1}) \\ \mu_t & = \bar{\mu}_t + K_t (x_t - C \bar{\mu}_t) \quad \text{(corrected mean)} \\ \Sigma_t & = (I - K_t C) \bar{\Sigma}_t \quad (\text{corrected covariance} = \Sigma_{ZZ} - \Sigma_{ZX}\Sigma_{XX}^{-1}\Sigma_{XZ}) \end{align*}\]
Innovation: \(y_t = x_t - C \bar{\mu}_t\) (prediction error)
Kalman Gain Interpretation:
\(K_t \approx 0\): High observation noise (\(R\) large) → trust prediction
\(K_t \approx C^{-1}\): Low observation noise (\(R\) small) → trust measurement
Joseph Form:
Numerical stability: Floating point errors can cause \(\Sigma_t\) to lose positive definiteness.
Joseph form of covariance update maintains symmetry and positive definiteness: \[\Sigma_t = (I - K_t C) \bar{\Sigma}_t (I - K_t C)^\top + K_t R K_t^\top\]
6.3 Steady-State Kalman Filter and the Riccati Equation
For time-invariant systems (constant \(A, C, Q, R\)), the covariance \(\Sigma_t\) converges to a steady-state value \(\Sigma_\infty\) as \(t \to \infty\).
Discrete Algebraic Riccati Equation (DARE): \[\Sigma_\infty = A \Sigma_\infty A^\top + Q - A \Sigma_\infty C^\top (C \Sigma_\infty C^\top + R)^{-1} C \Sigma_\infty A^\top\]
This is a fixed-point equation: \(\Sigma_\infty = f(\Sigma_\infty)\) where \(f\) is the covariance update.
Practical Implications:
Offline computation: Solve DARE once offline (e.g., via iterative methods or control toolboxes)
Runtime efficiency: Use constant \(\Sigma_\infty\) and \(K_\infty = \Sigma_\infty C^\top (C \Sigma_\infty C^\top + R)^{-1}\)
No covariance updates: Only update state mean \(\mu_t\) at runtime (saves matrix operations)
Common in practice: Most deployed systems use steady-state KF for computational efficiency
When Steady-State KF Applies:
Time-invariant dynamics (\(A, C, Q, R\) constant)
System is observable and controllable (DARE solution exists and is unique)
Long-running applications where transient behavior is negligible (tracking, navigation)
Connection to LQG Control: DARE also appears in Linear-Quadratic-Gaussian (LQG) optimal control–Kalman filter is the optimal state estimator, and DARE solution gives the steady-state estimation error covariance.
6.4 Extended Kalman Filter (EKF)
For nonlinear dynamics: \[\begin{align*} z_t & = f(z_{t-1}, u_t) + w_t \\ x_t & = h(z_t) + v_t \end{align*}\]
Linearize around current estimate via first-order Taylor expansion: \[A_t = \frac{\partial f}{\partial z}\bigg|_{z=\mu_{t-1}}, \quad C_t = \frac{\partial h}{\partial z}\bigg|_{z=\bar{\mu}_t}\]
Then apply standard Kalman filter with time-varying \(A_t, C_t\).
Limitations: EKF can fail for highly nonlinear systems (linearization error accumulates).
6.5 Unscented Kalman Filter (UKF)
Idea: Approximate Gaussian distribution by sigma points–deterministically chosen samples that capture mean and covariance.
Key Insight: In the Kalman filter, covariance evolution is independent of actual observations: \[\bar{\Sigma}_{t+1} = A \Sigma_t A^\top + Q\] This depends only on the previous covariance \(\Sigma_t\), not on \(x_t\)! To propagate uncertainty through nonlinear \(f\), we only need to track how the distribution of errors transforms.
UKF’s Solution: Sample the input distribution, propagate samples through \(f\), estimate output statistics.
Generate sigma points \(\mathcal{X}^{(i)}\) around mean \(\mu\) with covariance \(\Sigma\):
\(\mathcal{X}^{(0)} = \mu\) (center point)
\(\mathcal{X}^{(i)} = \mu + \sqrt{(n+\lambda)\Sigma}_i\) for \(i=1,\ldots,n\) (spread along principal axes)
\(\mathcal{X}^{(i)} = \mu - \sqrt{(n+\lambda)\Sigma}_{i-n}\) for \(i=n+1,\ldots,2n\)
where \(\sqrt{\Sigma}\) is Cholesky decomposition, \(\lambda\) is tuning parameter.
Propagate each sigma point: \(\mathcal{Y}^{(i)} = f(\mathcal{X}^{(i)})\) (no linearization!)
Estimate output statistics from transformed samples: \[\begin{align*} \bar{\mu} & = \sum_{i=0}^{2n} w_i^{(m)} \mathcal{Y}^{(i)} \\ \bar{\Sigma} & = \sum_{i=0}^{2n} w_i^{(c)} (\mathcal{Y}^{(i)} - \bar{\mu})(\mathcal{Y}^{(i)} - \bar{\mu})^\top \end{align*}\] where \(w_i\) are weights (chosen to match mean and covariance exactly).
Apply same process to observation model \(h\) for measurement update
Why This Works:
Sigma points capture 1st and 2nd moments (mean, covariance) exactly for Gaussian
Propagating 2n+1 points captures curvature of \(f\) better than single linearization point (EKF)
Accurate up to 3rd-order moments for polynomial nonlinearities
Extension: Could use more sophisticated sampling (e.g., cubature rules, higher-order unscented transforms) if needed
Advantages over EKF:
No Jacobian computation (derivative-free)
Better approximation for highly nonlinear functions
More robust to nonlinearity
Numerical stability (Cholesky decomposition maintains positive-definiteness)
6.6 Kalman Smoother (RTS Smoother)
Compute smoothed estimate \(p(z_t \mid x_{1:T})\) (uses all data, not just past).
Forward pass: Standard Kalman filter → \(\mu_t, \Sigma_t\)
Backward pass: Rauch-Tung-Striebel (RTS) recursion: \[\begin{align*} G_t & = \Sigma_t A^\top \bar{\Sigma}_{t+1}^{-1} \\ \mu_t^s & = \mu_t + G_t (\mu_{t+1}^s - \bar{\mu}_{t+1}) \\ \Sigma_t^s & = \Sigma_t + G_t (\Sigma_{t+1}^s - \bar{\Sigma}_{t+1}) G_t^\top \end{align*}\]
Smoother has lower uncertainty than filter: \(\Sigma_t^s \preceq \Sigma_t\) (in PSD sense).
When to Use Kalman Filters:
Tracking: Object tracking in video, GPS/IMU sensor fusion
Control: Robotics, autonomous vehicles (state estimation for controller)
Signal processing: Noise reduction, prediction in time series
Economics: State-space models for GDP, inflation, etc.
Assumptions Required:
Linear dynamics (or use EKF/UKF for nonlinear)
Gaussian noise
Known noise covariances \(Q, R\)
Classic Example: Tracking a Moving Object
State: \(z_t = [x, y, \dot{x}, \dot{y}]^\top\) (position + velocity)
Observation: \(x_t = [x_{\text{meas}}, y_{\text{meas}}]^\top\) (noisy GPS measurements)
Constant velocity model: \[A = \begin{bmatrix} 1 & 0 & \Delta t & 0 \\ 0 & 1 & 0 & \Delta t \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}, \quad C = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{bmatrix}\]
Kalman filter fuses noisy measurements with motion model to estimate true position/velocity.
7 Particle Filters (Sequential Monte Carlo)
7.1 Motivation: Beyond Gaussians
Limitation of Kalman Filters: Require Gaussian distributions.
Particle Filter (PF) represents belief \(p(z_t \mid x_{1:t})\) by weighted samples (particles): \[p(z_t \mid x_{1:t}) \approx \sum_{i=1}^N w_t^{(i)} \delta(z_t - z_t^{(i)})\]
where \(\{z_t^{(i)}, w_t^{(i)}\}_{i=1}^N\) are particles and weights.
Advantages:
Handles arbitrary distributions (multimodal, non-Gaussian)
Nonlinear dynamics without linearization
Asymptotically converges to true posterior as \(N \to \infty\)
7.2 Sequential Importance Sampling (SIS)
Importance Sampling: To estimate \(\mathbb{E}(f(z))\) under \(p(z)\), sample from proposal \(q(z)\): \[\mathbb{E}(f(z)) = \int f(z) p(z) dz = \int f(z) \frac{p(z)}{q(z)} q(z) dz \approx \frac{1}{N} \sum_{i=1}^N w^{(i)} f(z^{(i)})\] where \(z^{(i)} \sim q(z)\) and \(w^{(i)} = \frac{p(z^{(i)})}{q(z^{(i)})}\) (importance weight).
Sequential Update:
At time \(t\), propose new particle: \(z_t^{(i)} \sim q(z_t \mid z_{t-1}^{(i)}, x_t)\)
Update weight: \[w_t^{(i)} \propto w_{t-1}^{(i)} \frac{p(x_t \mid z_t^{(i)}) p(z_t^{(i)} \mid z_{t-1}^{(i)})}{q(z_t^{(i)} \mid z_{t-1}^{(i)}, x_t)}\]
Normalize: \(w_t^{(i)} \leftarrow \frac{w_t^{(i)}}{\sum_j w_t^{(j)}}\)
7.3 Bootstrap Particle Filter
Simplest choice: Use transition prior as proposal: \(q(z_t \mid z_{t-1}, x_t) = p(z_t \mid z_{t-1})\)
Weight update simplifies to: \[w_t^{(i)} \propto w_{t-1}^{(i)} p(x_t \mid z_t^{(i)})\]
Algorithm:
Sample \(z_t^{(i)} \sim p(z_t \mid z_{t-1}^{(i)})\) (predict)
Compute \(w_t^{(i)} \propto w_{t-1}^{(i)} p(x_t \mid z_t^{(i)})\) (weight by likelihood)
Normalize weights
Resample if needed (see below)
7.4 Degeneracy Problem & Resampling
Problem: After a few iterations, most weights become negligible–only few particles contribute (degeneracy).
Solution: Resampling–duplicate high-weight particles, discard low-weight ones.
Effective Sample Size (ESS): \[N_{\text{eff}} = \frac{1}{\sum_{i=1}^N (w_t^{(i)})^2}\]
If \(N_{\text{eff}} < N_{\text{thresh}}\) (e.g., \(N/2\)), resample.
Resampling Methods:
Multinomial: Draw \(N\) samples with replacement according to weights
Systematic: Lower variance, deterministic spacing
Residual: Hybrid approach
After resampling, reset weights: \(w_t^{(i)} = 1/N\).
Particle Filters vs Kalman Filters:
Kalman: Exact for linear-Gaussian, \(O(n^3)\) per step (matrix ops)
Particle: Approximate, arbitrary nonlinear/non-Gaussian, \(O(N)\) per step
PF needs many particles (\(N \sim 100\)-10000) → computationally expensive
Use Kalman when applicable (faster, exact); PF when Kalman assumptions fail
8 Markov Decision Processes (MDPs)
8.1 MDP Framework
MDP models sequential decision-making with stochastic transitions.
Components:
\(\mathcal{S}\): state space
\(\mathcal{A}\): action space
\(P(s' \mid s, a)\): transition probability (dynamics)
\(R(s, a)\): reward function
\(\gamma \in [0,1)\): discount factor
Goal: Find policy \(\pi: \mathcal{S} \to \mathcal{A}\) (or \(\pi(a \mid s)\) for stochastic) maximizing expected cumulative reward: \[J(\pi) = \mathbb{E}(\sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid \pi)\]
8.2 Value Functions
State value function: Expected return starting from state \(s\) following policy \(\pi\): \[V^\pi(s) = \mathbb{E}(\sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid s_0 = s, \pi)\]
Action value function (Q-function): Expected return starting from \((s, a)\): \[Q^\pi(s, a) = \mathbb{E}(\sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid s_0 = s, a_0 = a, \pi)\]
Relationship: \(V^\pi(s) = \sum_a \pi(a \mid s) Q^\pi(s, a)\) (for stochastic \(\pi\))
8.3 Bellman Equations
Bellman Expectation Equation: \[\begin{align*} V^\pi(s) & = \sum_a \pi(a \mid s) \left[R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V^\pi(s')\right] \\ Q^\pi(s,a) & = R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) \sum_{a'} \pi(a' \mid s') Q^\pi(s', a') \end{align*}\]
Bellman Optimality Equation: \[\begin{align*} V^*(s) & = \max_a \left[R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V^*(s')\right] \\ Q^*(s,a) & = R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) \max_{a'} Q^*(s', a') \end{align*}\]
Optimal policy: \(\pi^*(s) = \operatorname*{arg\,max}_a Q^*(s,a)\)
8.4 Dynamic Programming Methods
Assumption: Model is known (\(P, R\) available).
Policy Evaluation: Compute \(V^\pi\) for given \(\pi\).
Iterative update: \[V_{k+1}(s) = \sum_a \pi(a \mid s) \left[R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V_k(s')\right]\]
Converges to \(V^\pi\) as \(k \to \infty\).
Value Iteration: Compute optimal value function \(V^*\).
\[V_{k+1}(s) = \max_a \left[R(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V_k(s')\right]\]
Converges to \(V^*\). Extract policy: \(\pi(s) = \operatorname*{arg\,max}_a Q^*(s,a)\).
Policy Iteration:
Policy Evaluation: Compute \(V^\pi\) (solve Bellman expectation)
Policy Improvement: Update \(\pi(s) \leftarrow \operatorname*{arg\,max}_a Q^\pi(s,a)\)
Repeat until convergence
Often converges faster than value iteration (fewer iterations, but each step more expensive).
DP Complexity:
Value iteration: \(O(|\mathcal{S}|^2 |\mathcal{A}|)\) per iteration
Policy iteration: \(O(|\mathcal{S}|^3 + |\mathcal{S}|^2 |\mathcal{A}|)\) per iteration
Feasible only for small/medium state spaces (\(|\mathcal{S}| < 10^6\))
For large/continuous spaces → function approximation (RL)
9 Reinforcement Learning
9.1 The RL Problem
Key Difference from Supervised Learning:
No labeled data \((s,a,R)\) provided
Agent must explore environment to discover rewards
Exploration-exploitation trade-off: Try new actions vs exploit known good actions
Credit assignment: Which action led to reward?
Model-Free RL: Learn policy/value function without learning transition model \(P\).
9.2 Q-Learning (Off-Policy TD Control)
Idea: Learn \(Q^*(s,a)\) directly from experience without model.
Temporal Difference (TD) Update: \[Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[R_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right]\]
where:
\(\alpha\): learning rate
\(R_t + \gamma \max_{a'} Q(s_{t+1}, a')\): TD target (bootstrap estimate)
\(R_t + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\): TD error
Q-Learning Algorithm:
Initialize \(Q(s,a)\) arbitrarily (e.g., zeros) Initialize state \(s\) Choose action \(a\) using \(\varepsilon\)-greedy policy from \(Q\) Take action \(a\), observe \(r, s'\) \(Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]\) \(s \leftarrow s'\)
\(\varepsilon\)-greedy Exploration: \[a = \begin{cases} \operatorname*{arg\,max}_{a'} Q(s, a'), & \text{with probability } 1-\varepsilon \\ \text{random action}, & \text{with probability } \varepsilon \end{cases}\]
Off-Policy: Q-learning learns optimal policy while following exploratory policy.
Convergence: Guaranteed if all \((s,a)\) pairs visited infinitely often and learning rate decays appropriately (\(\sum_t \alpha_t = \infty\), \(\sum_t \alpha_t^2 < \infty\)).
9.3 SARSA (On-Policy TD Control)
Similar to Q-learning but updates based on action actually taken (not max): \[Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[R_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\right]\]
On-Policy: Learns value of policy being followed (e.g., \(\varepsilon\)-greedy).
Q-Learning vs SARSA:
Q-learning: More aggressive, learns optimal policy, can be unstable
SARSA: More conservative, learns safe policy under exploration, more stable
9.4 Deep Q-Networks (DQN)
For large state spaces (e.g., images), represent \(Q(s,a; \theta)\) with neural network.
Key Innovations:
Experience Replay: Store transitions \((s, a, r, s')\) in replay buffer \(\mathcal{D}\). Sample random minibatch to break correlations.
Target Network: Use separate network \(Q(s,a; \theta^-)\) for TD target (updated periodically). Stabilizes training.
Loss Function: \[L(\theta) = \mathbb{E}((r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta))^2)\]
DQN Algorithm:
Initialize \(Q(s,a; \theta)\) and target \(Q(s,a; \theta^-)\) Initialize replay buffer \(\mathcal{D}\) Select action \(a\) using \(\varepsilon\)-greedy Execute \(a\), observe \(r, s'\) Store \((s, a, r, s')\) in \(\mathcal{D}\) Sample random minibatch from \(\mathcal{D}\) Compute TD target: \(y = r + \gamma \max_{a'} Q(s', a'; \theta^-)\) Update \(\theta\) by minimizing \((y - Q(s, a; \theta))^2\) Every \(C\) steps: \(\theta^- \leftarrow \theta\)
9.5 Policy Gradient Methods
Idea: Directly parameterize policy \(\pi(a \mid s; \theta)\) and optimize via gradient ascent.
Objective: \[J(\theta) = \mathbb{E}(\sum_{t=0}^\infty \gamma^t R_t \mid \pi_\theta)\]
Policy Gradient Theorem: \[\nabla_\theta J(\theta) = \mathbb{E}(\nabla_\theta \log \pi(a \mid s; \theta) Q^\pi(s,a))\]
REINFORCE Algorithm:
Monte Carlo estimate of gradient: \[\nabla_\theta J(\theta) \approx \sum_t \nabla_\theta \log \pi(a_t \mid s_t; \theta) G_t\] where \(G_t = \sum_{k=t}^T \gamma^{k-t} R_k\) (return from time \(t\)).
Update: \(\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\)
Variance Reduction (Baseline):
Subtract baseline \(b(s)\) (typically \(V(s)\)): \[\nabla_\theta J(\theta) \approx \sum_t \nabla_\theta \log \pi(a_t \mid s_t; \theta) (G_t - b(s_t))\]
Advantage function: \(A(s,a) = Q(s,a) - V(s)\)
9.6 Actor-Critic Methods
Combine value-based and policy-based methods:
Actor: Policy \(\pi(a \mid s; \theta)\) (chooses actions)
Critic: Value function \(V(s; w)\) or \(Q(s,a; w)\) (evaluates actions)
Advantage Actor-Critic (A2C):
Policy gradient: \(\nabla_\theta J \approx \nabla_\theta \log \pi(a_t \mid s_t; \theta) A_t\)
where \(A_t = r_t + \gamma V(s_{t+1}; w) - V(s_t; w)\) (TD error as advantage estimate).
Update actor: \(\theta \leftarrow \theta + \alpha_\theta \nabla_\theta \log \pi(a \mid s; \theta) A_t\)
Update critic: \(w \leftarrow w + \alpha_w A_t \nabla_w V(s; w)\)
Modern Algorithms:
A3C (Asynchronous Advantage Actor-Critic): Parallel actors
PPO (Proximal Policy Optimization): Clipped objective for stable updates
SAC (Soft Actor-Critic): Maximum entropy RL (exploration bonus)
TD3 (Twin Delayed DDPG): For continuous control
Value-Based (Q-Learning) vs Policy-Based (PG):
Q-Learning: Better sample efficiency, works for discrete actions only, can be unstable
Policy Gradient: Works for continuous actions, more stable, higher variance
Actor-Critic: Best of both–stable, efficient, works for continuous actions
10 Interview Questions & Key Concepts
10.1 Common Interview Questions
Q: What’s the difference between HMM and Kalman filter?
A: HMM has discrete states, arbitrary emission distributions. Kalman filter has continuous states, linear-Gaussian dynamics. HMM uses forward-backward/Viterbi (sum/max over discrete states). Kalman uses matrix operations (closed-form Gaussian updates).
Q: Explain the three HMM problems.
A: (1) Evaluation: Compute \(p(x_{1:T})\) or \(p(z_t \mid x_{1:T})\) (forward-backward). (2) Decoding: Find best state sequence \(\operatorname*{arg\,max} p(z_{1:T} \mid x_{1:T})\) (Viterbi). (3) Learning: Estimate parameters from data (Baum-Welch/EM).
Q: When would you use EKF vs UKF vs particle filter?
A: EKF for mildly nonlinear systems (fast, needs Jacobians). UKF for moderately nonlinear (no derivatives, better than EKF). Particle filter for highly nonlinear or non-Gaussian (arbitrary distributions, computationally expensive).
Q: What is the Kalman gain and what does it represent?
A: \(K = \bar{\Sigma} C^\top (C \bar{\Sigma} C^\top + R)^{-1}\) balances prediction vs measurement. High \(K\) → trust measurement (low \(R\)). Low \(K\) → trust prediction (high \(R\) or low \(\bar{\Sigma}\)). It’s the optimal gain minimizing posterior covariance.
Q: Explain the Bellman equation.
A: Recursive relationship: Value of state = immediate reward + discounted value of next state. \(V(s) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s')\). Foundation of dynamic programming and RL. Optimality version uses max over actions.
Q: Q-learning vs SARSA?
A: Q-learning is off-policy (learns optimal \(Q^*\) while exploring). SARSA is on-policy (learns value of policy being followed). Q-learning more aggressive, SARSA more conservative. Q-learning update uses \(\max_a Q(s',a)\); SARSA uses \(Q(s', a')\) where \(a'\) is actually taken.
Q: Why does DQN need experience replay?
A: Consecutive samples are highly correlated (violates i.i.d. assumption). Replay buffer breaks correlations, improves data efficiency (reuse experiences), stabilizes training. Also helps with catastrophic forgetting.
Q: What’s the exploration-exploitation trade-off?
A: Exploitation: Choose known best action (maximize immediate reward). Exploration: Try new actions (might find better strategy). Too much exploitation → stuck in local optimum. Too much exploration → never converge. Balance with \(\varepsilon\)-greedy, UCB, or entropy regularization.
Q: Policy gradient vs value-based methods?
A: Value-based (Q-learning) learns \(Q(s,a)\), derives policy via \(\operatorname*{arg\,max}\)–works only for discrete actions. Policy gradient directly optimizes parameterized policy–works for continuous actions, more stable, but higher variance. Actor-critic combines both.
Q: How do particle filters work?
A: Represent belief by weighted samples (particles). (1) Predict: Propagate particles through dynamics. (2) Update: Weight by observation likelihood. (3) Resample when weights degenerate. Non-parametric, handles arbitrary distributions, but needs many particles.
10.2 Key Takeaways for Interviews
Must-Know Concepts:
HMMs: Forward-backward (filtering/smoothing), Viterbi (decoding), Baum-Welch (learning)
Kalman Filter: Predict (propagate mean/cov), update (Kalman gain), linear-Gaussian assumption
EKF/UKF: Nonlinear extensions via linearization or sigma points
Particle Filters: Weighted samples, resampling, non-parametric
MDPs: States, actions, rewards, transitions, discount factor
Bellman Equations: Recursive decomposition of value functions
Q-Learning: Off-policy TD, \(\max\) operator, \(\varepsilon\)-greedy exploration
DQN: Experience replay, target network, neural Q-function
Policy Gradient: Direct policy optimization, REINFORCE, variance reduction
Actor-Critic: Combine policy (actor) and value (critic)
Common Pitfalls:
Confusing filtering (\(p(z_t \mid x_{1:t})\)) with smoothing (\(p(z_t \mid x_{1:T})\))
Forgetting Kalman filter requires linear dynamics (use EKF/UKF otherwise)
Using Q-learning for continuous action spaces (use policy gradient or actor-critic)
Not understanding off-policy (Q-learning) vs on-policy (SARSA)
Ignoring exploration (greedy policies fail in RL)
Thinking DP solves large-scale problems (only feasible for small state spaces)
10.3 Algorithm Complexity Comparison
| Algorithm | Time per Step | Space |
|---|---|---|
| HMM Forward-Backward | \(O(K^2 T)\) | \(O(KT)\) |
| Viterbi | \(O(K^2 T)\) | \(O(KT)\) |
| Kalman Filter | \(O(n^3)\) | \(O(n^2)\) |
| Particle Filter | \(O(N)\) | \(O(N)\) |
| Value Iteration | \(O(|\mathcal{S}|^2 |\mathcal{A}|)\) | \(O(|\mathcal{S}|)\) |
| Q-Learning (tabular) | \(O(1)\) | \(O(|\mathcal{S}| |\mathcal{A}|)\) |
| DQN | \(O(\text{forward pass})\) | \(O(|\mathcal{D}|)\) (replay buffer) |
where \(K\) = number of states (HMM), \(n\) = state dimension (Kalman), \(N\) = number of particles.
11 References & Further Reading
Rabiner (1989): A Tutorial on Hidden Markov Models–classic HMM tutorial
Kalman (1960): A New Approach to Linear Filtering and Prediction Problems–original paper
Thrun et al. (2005): Probabilistic Robotics–Kalman, particle filters, SLAM
Sutton & Barto (2018): Reinforcement Learning: An Introduction–comprehensive RL textbook
Mnih et al. (2015): Human-level control through deep RL–DQN paper
Schulman et al. (2017): Proximal Policy Optimization–PPO paper
Murphy (2012): Machine Learning: A Probabilistic Perspective–Ch 17-18 (sequential models)