10 Chapter 9: Recommendation Systems
11 Introduction to Recommendation Systems
Recommendation systems predict user preferences for items (products, movies, content) and are critical for e-commerce, streaming, social media, and advertising platforms.
11.1 Core Problem Formulation
Given users \(\mathcal{U} = \{u_1, \ldots, u_m\}\), items \(\mathcal{I} = \{i_1, \ldots, i_n\}\), and interactions \(\mathcal{R} \subseteq \mathcal{U} \times \mathcal{I}\), predict score \(\hat{r}_{ui}\) representing user \(u\)’s preference for item \(i\).
Core Challenges: Sparsity (99%+ missing entries), cold start (no history for new users/items), scalability (billions of user-item pairs), feedback loops (recommendations bias future interactions).
11.2 Types of Feedback
Explicit: Ratings (1-5 stars). Sparse but high signal. Loss: MSE regression \(\sum_{(u,i)} (r_{ui} - \hat{r}_{ui})^2\).
Implicit: Clicks, views, purchases. Dense but noisy (view \(\neq\) like). Treat as confidence-weighted: \(c_{ui} = 1 + \alpha \cdot (\text{count})\). Use negative sampling (1:4-10 ratio). BPR loss assumes observed \(>\) unobserved.
12 Collaborative Filtering
Collaborative filtering leverages patterns in user-item interactions without content features.
12.1 Memory-Based (Nearest Neighbors)
User-Based CF: \[\begin{equation} \hat{r}_{ui} = \bar{r}_u + \frac{\sum_{v \in N(u)} \text{sim}(u,v) \cdot (r_{vi} - \bar{r}_v)}{\sum_{v \in N(u)} |\text{sim}(u,v)|} \end{equation}\]
where \(N(u)\) = top-k similar users, \(\text{sim}(u,v)\) = cosine/Pearson correlation.
Item-Based CF: \[\begin{equation} \hat{r}_{ui} = \frac{\sum_{j \in N(i)} \text{sim}(i,j) \cdot r_{uj}}{\sum_{j \in N(i)} |\text{sim}(i,j)|} \end{equation}\]
When to use which: Item-based preferred when \(|\mathcal{I}| \ll |\mathcal{U}|\) (e-commerce)–item similarities more stable, can precompute offline, \(O(n^2)\) vs \(O(m^2)\). Both struggle with sparsity, cold start, and feature generalization.
12.2 Matrix Factorization
Decompose sparse user-item matrix \(R \in \mathbb{R}^{m \times n}\) into low-rank factors: \[\begin{equation} R \approx P Q^T, \quad \hat{r}_{ui} = \mathbf{p}_u^T \mathbf{q}_i \end{equation}\]
where \(P \in \mathbb{R}^{m \times k}\) (user embeddings), \(Q \in \mathbb{R}^{n \times k}\) (item embeddings), \(k \ll \min(m, n)\).
SVD++: Adds biases: \(\hat{r}_{ui} = \mu + b_u + b_i + \mathbf{p}_u^T \mathbf{q}_i\)
12.2.1 Alternating Least Squares (ALS)
\[\begin{equation} \min_{P, Q} \sum_{(u,i) \in \mathcal{R}} (r_{ui} - \mathbf{p}_u^T \mathbf{q}_i)^2 + \lambda (||P||^2 + ||Q||^2) \end{equation}\]
Algorithm: (1) Fix \(Q\), solve \(\mathbf{p}_u = (Q^T Q + \lambda I)^{-1} Q^T \mathbf{r}_u\); (2) Fix \(P\), solve \(\mathbf{q}_i\); (3) Repeat.
Why ALS over SGD: Embarrassingly parallel (all \(\mathbf{q}_i\) independent with \(P\) fixed)–critical for Spark/distributed. Handles confidence-weighted implicit feedback naturally. Closed-form (no learning rate). SGD: faster per-iteration but sequential.
Non-Negative MF (NMF): Constraint \(P, Q \geq 0\) gives interpretable, additive factors.
13 Deep Learning for Recommendations
13.1 Neural Collaborative Filtering (NCF)
Replace inner product with neural network: \[\begin{equation} \hat{r}_{ui} = \text{MLP}([\mathbf{p}_u || \mathbf{q}_i]) \end{equation}\]
Architecture: Embedding layer → concatenate user/item → MLP with ReLU → sigmoid output. Loss: BCE with negative sampling.
PyTorch NCF:
class NCF(nn.Module):
def __init__(self, num_users, num_items, embed_dim=64, layers=[128, 64, 32]):
super().__init__()
self.user_embed = nn.Embedding(num_users, embed_dim)
self.item_embed = nn.Embedding(num_items, embed_dim)
mlp = []
input_dim = 2 * embed_dim
for hidden_dim in layers:
mlp.extend([nn.Linear(input_dim, hidden_dim), nn.ReLU()])
input_dim = hidden_dim
mlp.append(nn.Linear(input_dim, 1))
self.mlp = nn.Sequential(*mlp)
def forward(self, user_ids, item_ids):
x = torch.cat([self.user_embed(user_ids), self.item_embed(item_ids)],
dim=-1)
return torch.sigmoid(self.mlp(x))
13.2 Wide & Deep Learning
Combines memorization (wide) and generalization (deep): \[\begin{equation} \hat{y} = \sigma(\mathbf{w}_{\text{wide}}^T [\mathbf{x}, \phi(\mathbf{x})] + \mathbf{w}_{\text{deep}}^T a^{(l_f)} + b) \end{equation}\]
Wide: Linear model on cross-product features (e.g., AND(country=US, category=electronics)). Memorizes specific rules.
Deep: MLP on embeddings. Generalizes to unseen combinations.
Why combine: Wide captures "users who installed X always click Y" that deep might smooth over. Deep prevents over-specialization. Google Play: significant lift from combining.
13.3 DeepFM
Replaces manual Wide features with Factorization Machines: \[\begin{equation} \hat{y} = \sigma\left(w_0 + \sum_i w_i x_i + \sum_i \sum_{j>i} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j + \text{DNN}(\mathbf{x})\right) \end{equation}\]
Key insight: FM learns all pairwise interactions via embeddings \(\mathbf{v}_i\)–no manual feature engineering. Efficient: \(O(kN)\) via identity \(\sum_{i<j} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j = \frac{1}{2}(\|\sum_i \mathbf{v}_i x_i\|^2 - \sum_i \|\mathbf{v}_i x_i\|^2)\). Shares embeddings between FM and DNN.
13.4 Deep & Cross Network (DCN)
Explicitly models feature crosses at each layer: \[\begin{equation} \mathbf{x}_{l+1} = \mathbf{x}_0 \mathbf{x}_l^T \mathbf{w}_l + \mathbf{b}_l + \mathbf{x}_l \end{equation}\]
Bounded-degree polynomial interactions (degree = layers). \(O(d)\) params per layer. Interpretable crosses.
14 Two-Tower Models
Decouple user and item encoders for efficient large-scale retrieval: \[\begin{align} \mathbf{u} & = f_{\text{user}}(\text{user features}; \theta_u), \quad \mathbf{v} = f_{\text{item}}(\text{item features}; \theta_i) \\ \text{score}(u, i) & = \mathbf{u}^T \mathbf{v} \end{align}\]
Training: Sampled softmax / contrastive loss: \[\begin{equation} \mathcal{L} = -\log \frac{\exp(\mathbf{u}^T \mathbf{v}_+)}{\exp(\mathbf{u}^T \mathbf{v}_+) + \sum_{j \in \text{neg}} \exp(\mathbf{u}^T \mathbf{v}_j)} \end{equation}\]
Why two-tower: Precompute item embeddings offline → ANN search (FAISS, ScaNN) at serving: \(O(\log n)\) vs \(O(n)\). Trade-off: cannot model complex user-item interactions (dot product limited). Solution: two-tower for retrieval (millions→hundreds), then re-rank with DeepFM/DLRM.
DSSM: Two-tower with character n-gram hashing for text (search ranking, QA, ad relevance).
15 DLRM (Deep Learning Recommendation Model)
Facebook’s production CTR model:
Architecture:
Bottom MLP: Dense features → \(\mathbf{z}_{\text{dense}} \in \mathbb{R}^d\)
Embeddings: Sparse/categorical → \(\mathbf{e}_i \in \mathbb{R}^d\) per feature
Interaction: All pairwise dot products \(\langle \mathbf{e}_i, \mathbf{e}_j \rangle\) (including dense vec), total \(\binom{k+1}{2}\)
Top MLP: Concatenate interactions → \(\hat{y} = \sigma(\text{MLP}([\mathbf{p}_{12}, \ldots, \mathbf{e}_1, \ldots, \mathbf{z}_{\text{dense}}]))\)
DLRM vs DeepFM: DeepFM: FM weighted sum of interactions. DLRM: explicit dot products as features to MLP. DLRM scales better (model parallelism: embedding tables across GPUs, data parallelism for MLPs, all-to-all for lookups).
Scaling challenges: Embedding tables \(O(10^9)\) params (10s-100s GB), billions of daily examples, skewed access (popular items). Solutions: model parallelism, hard negative sampling, FP16 embeddings, quantization-aware training.
16 Ranking and Evaluation
16.1 Ranking Metrics
Precision@K: Fraction of top-K recommendations that are relevant: \[\begin{equation} \text{P@K} = \frac{|\{\text{relevant items in top-K}\}|}{K} \end{equation}\]
High precision means few false positives, but ignores whether all relevant items were found.
Recall@K: Fraction of all relevant items that appear in top-K: \[\begin{equation} \text{R@K} = \frac{|\{\text{relevant items in top-K}\}|}{|\text{all relevant items}|} \end{equation}\]
High recall means comprehensive coverage, but may include many irrelevant items.
Mean Average Precision (MAP): \[\begin{equation} \text{MAP@K} = \frac{1}{|U|} \sum_{u \in U} \frac{1}{\min(m_u, K)} \sum_{k=1}^K \text{P@k} \cdot \mathbb{1}_{\text{rel}}(k) \end{equation}\]
where \(m_u\) = total relevant items for user \(u\), \(\mathbb{1}_{\text{rel}}(k) = 1\) if item at rank \(k\) is relevant (else 0). Emphasizes precision at each position–rewards putting relevant items early.
Normalized Discounted Cumulative Gain (NDCG): \[\begin{equation} \text{DCG@K} = \sum_{i=1}^K \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}, \quad \text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}} \end{equation}\]
where IDCG = ideal DCG (perfect ranking). Handles graded relevance (e.g., 1-5 stars) and smoothly discounts lower positions. Most common metric for ranking tasks.
When to use: NDCG for graded relevance (star ratings), MAP for binary relevant/not. Both penalize lower ranks, but NDCG uses smoother \(\log_2\) decay.
16.2 Cold Start Problem
New User: No interaction history.
Popularity-based: Show globally trending/top items (safe baseline)
Content features: Use demographics (age, location) to find similar users’ preferences
Onboarding: Ask user to rate seed items or select favorite genres
Hybrid: Combine content-based + CF, weight toward content initially
New Item: No user interactions yet.
Content-based: Recommend to users who liked similar items (by metadata/description)
Exploration: Show to diverse users via multi-armed bandits (\(\varepsilon\)-greedy, Thompson sampling)
Transfer learning: Use pre-trained encoders (BERT for text, ResNet for images)
16.3 Sequential Recommendations
Model temporal dynamics: predict next item based on user’s interaction sequence.
RNN-based (GRU4Rec): \[\begin{equation} \mathbf{h}_t = \text{GRU}(\mathbf{h}_{t-1}, \mathbf{e}_{i_{t-1}}), \quad \hat{\mathbf{y}}_t = \text{softmax}(W \mathbf{h}_t) \end{equation}\]
Updates hidden state with each item, predicts next item distribution.
Transformer-based:
BERT4Rec: Masked item prediction (randomly mask items in sequence, predict them)
SASRec: Self-attention with causal masking (attends only to past items)
17 Advanced Topics
17.1 Multi-Armed Bandits
Balance exploration (trying new items) vs exploitation (recommending known good items).
\(\varepsilon\)-greedy: With probability \(\varepsilon\) explore (random item), else exploit (highest predicted score). Simple but doesn’t adapt uncertainty.
Thompson Sampling: Bayesian approach–maintain posterior over item quality: \[\begin{equation} \theta_i \sim \text{Beta}(\alpha_i, \beta_i) \end{equation}\]
Sample from each posterior, recommend item with highest sample. Update: \(\alpha_i \leftarrow \alpha_i + \text{click}\), \(\beta_i \leftarrow \beta_i + (1-\text{click})\). Naturally balances exploration/exploitation via uncertainty.
Contextual Bandits (LinUCB): Linear model with upper confidence bound: \[\begin{equation} \text{score}(a) = \mathbf{w}^T \mathbf{x}_a + \alpha \sqrt{\mathbf{x}_a^T A^{-1} \mathbf{x}_a} \end{equation}\]
UCB term encourages exploring items with high uncertainty. Generalizes to contexts (user features, time of day).
17.2 Debiasing
Selection Bias: Users only see items the system showed–logged data is biased by previous model.
Solutions:
Inverse propensity scoring: Weight loss by \(1/P(\text{item shown})\) to correct for exposure
Position bias correction: Items at top clicked more regardless of quality (unbiased learning-to-rank)
Randomized exploration: Show random items to small fraction of users (collect unbiased labels)
Popularity Bias: Popular items over-recommended (rich get richer), hurts diversity and long-tail discovery.
Solutions:
Diversity re-ranking: Penalize items similar to already-shown, boost under-represented categories
Calibration: Match recommendation distribution to user’s historical genre/category preferences
18 Production System Design
18.1 Two-Stage Architecture
Production systems use multi-stage pipelines to balance accuracy and latency at scale.
Stage 1: Candidate Generation (Retrieval)
Goal: Narrow millions of items to hundreds (recall-focused)
Methods: Two-tower models, ANN search (FAISS), matrix factorization
Latency: \(<\) 10ms
Stage 2: Ranking
Goal: Score candidates with full features (precision-focused)
Methods: DLRM, DeepFM, Wide & Deep
Latency: \(<\) 50ms for top-100 candidates
Stage 3: Re-ranking (optional)
Business logic: diversity, freshness, A/B test buckets
Calibration: adjust scores to match user preferences
Example System (100M users × 1M items):
Offline (daily):
Train two-tower on previous day’s interactions (1:10 negative sampling)
Export item embeddings \(\mathbf{v}_1, \ldots, \mathbf{v}_{1M}\)
Build FAISS index: IndexIVFPQ (quantized, partitioned for sub-linear search)
Train ranking model (DLRM) on full features
Online (per request, \(<\) 100ms):
User encoding (5ms): \(\mathbf{u} = f_{\text{user}}(\text{features})\) via GPU
Retrieval (10ms): FAISS search for top-500 by \(\mathbf{u}^T \mathbf{v}_i\)
Feature hydration (20ms): Fetch user/item features from Redis cache
Ranking (30ms): Score top-500 with DLRM (batched GPU inference)
Re-ranking (5ms): Apply business rules, return top-10
Infrastructure:
User sharding across servers (partition by user ID)
Redis caching for features (LRU for popular items)
TensorFlow Serving / TorchServe for batched model inference
A/B testing framework to route traffic to model versions
18.2 Negative Sampling Strategies
Random negatives: Sample unobserved items uniformly. Fast but weak signal.
Hard negatives: More informative but expensive to collect:
Items user viewed but didn’t click (in-session negatives)
Items similar to positive but not interacted
High-predicted score items user actually rejected
Popularity-based: Sample proportional to \(P(i)^{0.75}\) (like word2vec negative sampling).
Common Pitfall: Random-only negatives → model learns popularity, not preferences. Always mix: 70% random + 30% hard negatives.
18.3 Online vs Offline Metrics
Offline metrics (logged data):
AUC, logloss, NDCG@K on held-out test set
Fast iteration, but subject to biases
Online metrics (A/B test):
CTR (click-through rate), conversion rate, watch time
User engagement: session length, return rate
Business: revenue, retention
Why they diverge:
Distribution shift: Test set from past, deploy to future users
Selection bias: Logged data only contains previously shown items
Feedback loops: Recommendations change user behavior
Best practice: Use offline metrics for fast iteration, validate with online A/B before full rollout.
19 Interview Questions
Q1: User-based vs item-based CF–when to use each?
A: Item-based when \(|\mathcal{I}| \ll |\mathcal{U}|\) (e-commerce): item similarities stable, precomputable offline, \(O(n^2)\) vs \(O(m^2)\). User-based for small user bases (enterprise B2B). Both fail on cold start and lack generalization.
Q2: Why ALS instead of SGD for matrix factorization?
A: Parallelization–with \(P\) fixed, all \(\mathbf{q}_i\) solved independently (critical for Spark). Handles confidence-weighted implicit feedback naturally. Closed-form (no LR tuning). SGD faster per-iteration but sequential.
Q3: How does DeepFM handle sparse features efficiently?
A: Embedding per categorical value. FM identity: \(\sum_{i<j}\langle\mathbf{v}_i,\mathbf{v}_j\rangle x_ix_j = \frac{1}{2}(\|\sum_i\mathbf{v}_ix_i\|^2 - \sum_i\|\mathbf{v}_ix_i\|^2)\) → \(O(kN)\) for \(N\) non-zero features.
Q4: Why two-tower instead of single model? Trade-offs?
A: Precompute item embeddings → ANN retrieval \(O(\log n)\) vs \(O(n)\). Trade-off: dot product limits interaction modeling. Solution: two-tower for retrieval (M→100s), re-rank with DeepFM/DLRM.
Q5: DLRM vs DeepFM–which scales better?
A: DLRM scales better: model parallelism (embedding tables across GPUs), data parallelism (MLPs), all-to-all communication. DeepFM simpler for single-GPU scale.
Q6: How to handle implicit feedback vs explicit?
A: Confidence-weighted: \(c_{ui} = 1 + \alpha \cdot \text{count}\). Negative sampling (1:4-10 ratio). BPR loss assumes observed \(>\) unobserved. Absence \(\neq\) dislike, just lack of exposure.
Q7: NDCG vs MAP–when to use each?
A: NDCG for graded relevance (1-5 stars), MAP for binary (relevant/not). Both penalize lower ranks. NDCG smoother decay (\(\log_2\)), more common in practice.
Q8: Design a real-time recommender for 100M users × 1M items.
A: Offline: train two-tower, export embeddings, build FAISS IndexIVFPQ, train DLRM. Online (\(<\)100ms): user encoding (5ms) → FAISS top-500 (10ms) → Redis feature hydration (20ms) → DLRM ranking (30ms) → business rules (5ms). Scale: user sharding, Redis caching, batched inference.
Q9: Cold start solutions for new streaming service?
A: Launch: popularity + onboarding (genre selection) + content features (metadata + pretrained embeddings). Growth: blend CF as interactions accumulate + Thompson sampling exploration. Mature: two-tower retrieval + sequence models (what user just watched).
Q10: Why do offline and online metrics diverge?
A: Distribution shift (test=past, deploy=future), selection bias (only previously shown items), feedback loops (recommendations change behavior). Use offline for fast iteration, A/B for validation.
19.1 Quick Reference
| Method | Pros | Cons | Use Case |
|---|---|---|---|
| Item-based CF | Stable, offline compute | No generalization, cold start | E-commerce |
| Matrix Factor. | Scalable, generalizes | Linear interactions only | Baseline CF |
| NCF | Non-linear interactions | Slower than MF | When MF plateaus |
| DeepFM | Auto feature crossing | More complex | CTR with many features |
| DLRM | Billion-scale | Complex distributed train | Ads/feed at scale |
| Two-tower | Fast ANN retrieval | Limited interactions | Candidate generation |
19.2 Key Formulas
MF: \(\hat{r}_{ui} = \mathbf{p}_u^T \mathbf{q}_i\) ALS: \(\mathbf{p}_u = (Q^TQ + \lambda I)^{-1}Q^T\mathbf{r}_u\) Two-tower: \(\mathcal{L} = -\log\frac{e^{\mathbf{u}^T\mathbf{v}_+}}{e^{\mathbf{u}^T\mathbf{v}_+} + \sum_j e^{\mathbf{u}^T\mathbf{v}_j}}\)
NDCG: \(\text{DCG@K} = \sum_{i=1}^K \frac{2^{\text{rel}_i}-1}{\log_2(i+1)}\) FM: \(\sum_{i<j}\langle\mathbf{v}_i,\mathbf{v}_j\rangle = \frac{1}{2}(\|\sum_i\mathbf{v}_i\|^2 - \sum_i\|\mathbf{v}_i\|^2)\)