10 Chapter 9: Recommendation Systems

11 Introduction to Recommendation Systems

Recommendation systems predict user preferences for items (products, movies, content) and are critical for e-commerce, streaming, social media, and advertising platforms.

11.1 Core Problem Formulation

Given users \(\mathcal{U} = \{u_1, \ldots, u_m\}\), items \(\mathcal{I} = \{i_1, \ldots, i_n\}\), and interactions \(\mathcal{R} \subseteq \mathcal{U} \times \mathcal{I}\), predict score \(\hat{r}_{ui}\) representing user \(u\)’s preference for item \(i\).

Core Challenges: Sparsity (99%+ missing entries), cold start (no history for new users/items), scalability (billions of user-item pairs), feedback loops (recommendations bias future interactions).

11.2 Types of Feedback

Explicit: Ratings (1-5 stars). Sparse but high signal. Loss: MSE regression \(\sum_{(u,i)} (r_{ui} - \hat{r}_{ui})^2\).

Implicit: Clicks, views, purchases. Dense but noisy (view \(\neq\) like). Treat as confidence-weighted: \(c_{ui} = 1 + \alpha \cdot (\text{count})\). Use negative sampling (1:4-10 ratio). BPR loss assumes observed \(>\) unobserved.

12 Collaborative Filtering

Collaborative filtering leverages patterns in user-item interactions without content features.

12.1 Memory-Based (Nearest Neighbors)

User-Based CF: \[\begin{equation} \hat{r}_{ui} = \bar{r}_u + \frac{\sum_{v \in N(u)} \text{sim}(u,v) \cdot (r_{vi} - \bar{r}_v)}{\sum_{v \in N(u)} |\text{sim}(u,v)|} \end{equation}\]

where \(N(u)\) = top-k similar users, \(\text{sim}(u,v)\) = cosine/Pearson correlation.

Item-Based CF: \[\begin{equation} \hat{r}_{ui} = \frac{\sum_{j \in N(i)} \text{sim}(i,j) \cdot r_{uj}}{\sum_{j \in N(i)} |\text{sim}(i,j)|} \end{equation}\]

When to use which: Item-based preferred when \(|\mathcal{I}| \ll |\mathcal{U}|\) (e-commerce)–item similarities more stable, can precompute offline, \(O(n^2)\) vs \(O(m^2)\). Both struggle with sparsity, cold start, and feature generalization.

12.2 Matrix Factorization

Decompose sparse user-item matrix \(R \in \mathbb{R}^{m \times n}\) into low-rank factors: \[\begin{equation} R \approx P Q^T, \quad \hat{r}_{ui} = \mathbf{p}_u^T \mathbf{q}_i \end{equation}\]

where \(P \in \mathbb{R}^{m \times k}\) (user embeddings), \(Q \in \mathbb{R}^{n \times k}\) (item embeddings), \(k \ll \min(m, n)\).

SVD++: Adds biases: \(\hat{r}_{ui} = \mu + b_u + b_i + \mathbf{p}_u^T \mathbf{q}_i\)

12.2.1 Alternating Least Squares (ALS)

\[\begin{equation} \min_{P, Q} \sum_{(u,i) \in \mathcal{R}} (r_{ui} - \mathbf{p}_u^T \mathbf{q}_i)^2 + \lambda (||P||^2 + ||Q||^2) \end{equation}\]

Algorithm: (1) Fix \(Q\), solve \(\mathbf{p}_u = (Q^T Q + \lambda I)^{-1} Q^T \mathbf{r}_u\); (2) Fix \(P\), solve \(\mathbf{q}_i\); (3) Repeat.

Why ALS over SGD: Embarrassingly parallel (all \(\mathbf{q}_i\) independent with \(P\) fixed)–critical for Spark/distributed. Handles confidence-weighted implicit feedback naturally. Closed-form (no learning rate). SGD: faster per-iteration but sequential.

Non-Negative MF (NMF): Constraint \(P, Q \geq 0\) gives interpretable, additive factors.

13 Deep Learning for Recommendations

13.1 Neural Collaborative Filtering (NCF)

Replace inner product with neural network: \[\begin{equation} \hat{r}_{ui} = \text{MLP}([\mathbf{p}_u || \mathbf{q}_i]) \end{equation}\]

Architecture: Embedding layer → concatenate user/item → MLP with ReLU → sigmoid output. Loss: BCE with negative sampling.

Example

PyTorch NCF:

class NCF(nn.Module):
    def __init__(self, num_users, num_items, embed_dim=64, layers=[128, 64, 32]):
        super().__init__()
        self.user_embed = nn.Embedding(num_users, embed_dim)
        self.item_embed = nn.Embedding(num_items, embed_dim)
        mlp = []
        input_dim = 2 * embed_dim
        for hidden_dim in layers:
            mlp.extend([nn.Linear(input_dim, hidden_dim), nn.ReLU()])
            input_dim = hidden_dim
        mlp.append(nn.Linear(input_dim, 1))
        self.mlp = nn.Sequential(*mlp)
    
    def forward(self, user_ids, item_ids):
        x = torch.cat([self.user_embed(user_ids), self.item_embed(item_ids)], 
                      dim=-1)
        return torch.sigmoid(self.mlp(x))

13.2 Wide & Deep Learning

Combines memorization (wide) and generalization (deep): \[\begin{equation} \hat{y} = \sigma(\mathbf{w}_{\text{wide}}^T [\mathbf{x}, \phi(\mathbf{x})] + \mathbf{w}_{\text{deep}}^T a^{(l_f)} + b) \end{equation}\]

Wide: Linear model on cross-product features (e.g., AND(country=US, category=electronics)). Memorizes specific rules.

Deep: MLP on embeddings. Generalizes to unseen combinations.

Why combine: Wide captures "users who installed X always click Y" that deep might smooth over. Deep prevents over-specialization. Google Play: significant lift from combining.

13.3 DeepFM

Replaces manual Wide features with Factorization Machines: \[\begin{equation} \hat{y} = \sigma\left(w_0 + \sum_i w_i x_i + \sum_i \sum_{j>i} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j + \text{DNN}(\mathbf{x})\right) \end{equation}\]

Key insight: FM learns all pairwise interactions via embeddings \(\mathbf{v}_i\)–no manual feature engineering. Efficient: \(O(kN)\) via identity \(\sum_{i<j} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j = \frac{1}{2}(\|\sum_i \mathbf{v}_i x_i\|^2 - \sum_i \|\mathbf{v}_i x_i\|^2)\). Shares embeddings between FM and DNN.

13.4 Deep & Cross Network (DCN)

Explicitly models feature crosses at each layer: \[\begin{equation} \mathbf{x}_{l+1} = \mathbf{x}_0 \mathbf{x}_l^T \mathbf{w}_l + \mathbf{b}_l + \mathbf{x}_l \end{equation}\]

Bounded-degree polynomial interactions (degree = layers). \(O(d)\) params per layer. Interpretable crosses.

14 Two-Tower Models

Decouple user and item encoders for efficient large-scale retrieval: \[\begin{align} \mathbf{u} & = f_{\text{user}}(\text{user features}; \theta_u), \quad \mathbf{v} = f_{\text{item}}(\text{item features}; \theta_i) \\ \text{score}(u, i) & = \mathbf{u}^T \mathbf{v} \end{align}\]

Training: Sampled softmax / contrastive loss: \[\begin{equation} \mathcal{L} = -\log \frac{\exp(\mathbf{u}^T \mathbf{v}_+)}{\exp(\mathbf{u}^T \mathbf{v}_+) + \sum_{j \in \text{neg}} \exp(\mathbf{u}^T \mathbf{v}_j)} \end{equation}\]

Why two-tower: Precompute item embeddings offline → ANN search (FAISS, ScaNN) at serving: \(O(\log n)\) vs \(O(n)\). Trade-off: cannot model complex user-item interactions (dot product limited). Solution: two-tower for retrieval (millions→hundreds), then re-rank with DeepFM/DLRM.

DSSM: Two-tower with character n-gram hashing for text (search ranking, QA, ad relevance).

15 DLRM (Deep Learning Recommendation Model)

Facebook’s production CTR model:

Architecture:

Bottom MLP: Dense features → \(\mathbf{z}_{\text{dense}} \in \mathbb{R}^d\)
Embeddings: Sparse/categorical → \(\mathbf{e}_i \in \mathbb{R}^d\) per feature
Interaction: All pairwise dot products \(\langle \mathbf{e}_i, \mathbf{e}_j \rangle\) (including dense vec), total \(\binom{k+1}{2}\)
Top MLP: Concatenate interactions → \(\hat{y} = \sigma(\text{MLP}([\mathbf{p}_{12}, \ldots, \mathbf{e}_1, \ldots, \mathbf{z}_{\text{dense}}]))\)

DLRM vs DeepFM: DeepFM: FM weighted sum of interactions. DLRM: explicit dot products as features to MLP. DLRM scales better (model parallelism: embedding tables across GPUs, data parallelism for MLPs, all-to-all for lookups).

Scaling challenges: Embedding tables \(O(10^9)\) params (10s-100s GB), billions of daily examples, skewed access (popular items). Solutions: model parallelism, hard negative sampling, FP16 embeddings, quantization-aware training.

16 Ranking and Evaluation

16.1 Ranking Metrics

Precision@K: Fraction of top-K recommendations that are relevant: \[\begin{equation} \text{P@K} = \frac{|\{\text{relevant items in top-K}\}|}{K} \end{equation}\]

High precision means few false positives, but ignores whether all relevant items were found.

Recall@K: Fraction of all relevant items that appear in top-K: \[\begin{equation} \text{R@K} = \frac{|\{\text{relevant items in top-K}\}|}{|\text{all relevant items}|} \end{equation}\]

High recall means comprehensive coverage, but may include many irrelevant items.

Mean Average Precision (MAP): \[\begin{equation} \text{MAP@K} = \frac{1}{|U|} \sum_{u \in U} \frac{1}{\min(m_u, K)} \sum_{k=1}^K \text{P@k} \cdot \mathbb{1}_{\text{rel}}(k) \end{equation}\]

where \(m_u\) = total relevant items for user \(u\), \(\mathbb{1}_{\text{rel}}(k) = 1\) if item at rank \(k\) is relevant (else 0). Emphasizes precision at each position–rewards putting relevant items early.

Normalized Discounted Cumulative Gain (NDCG): \[\begin{equation} \text{DCG@K} = \sum_{i=1}^K \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}, \quad \text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}} \end{equation}\]

where IDCG = ideal DCG (perfect ranking). Handles graded relevance (e.g., 1-5 stars) and smoothly discounts lower positions. Most common metric for ranking tasks.

When to use: NDCG for graded relevance (star ratings), MAP for binary relevant/not. Both penalize lower ranks, but NDCG uses smoother \(\log_2\) decay.

16.2 Cold Start Problem

New User: No interaction history.

Popularity-based: Show globally trending/top items (safe baseline)
Content features: Use demographics (age, location) to find similar users’ preferences
Onboarding: Ask user to rate seed items or select favorite genres
Hybrid: Combine content-based + CF, weight toward content initially

New Item: No user interactions yet.

Content-based: Recommend to users who liked similar items (by metadata/description)
Exploration: Show to diverse users via multi-armed bandits (\(\varepsilon\)-greedy, Thompson sampling)
Transfer learning: Use pre-trained encoders (BERT for text, ResNet for images)

16.3 Sequential Recommendations

Model temporal dynamics: predict next item based on user’s interaction sequence.

RNN-based (GRU4Rec): \[\begin{equation} \mathbf{h}_t = \text{GRU}(\mathbf{h}_{t-1}, \mathbf{e}_{i_{t-1}}), \quad \hat{\mathbf{y}}_t = \text{softmax}(W \mathbf{h}_t) \end{equation}\]

Updates hidden state with each item, predicts next item distribution.

Transformer-based:

BERT4Rec: Masked item prediction (randomly mask items in sequence, predict them)
SASRec: Self-attention with causal masking (attends only to past items)

17 Advanced Topics

17.1 Multi-Armed Bandits

Balance exploration (trying new items) vs exploitation (recommending known good items).

\(\varepsilon\)-greedy: With probability \(\varepsilon\) explore (random item), else exploit (highest predicted score). Simple but doesn’t adapt uncertainty.

Thompson Sampling: Bayesian approach–maintain posterior over item quality: \[\begin{equation} \theta_i \sim \text{Beta}(\alpha_i, \beta_i) \end{equation}\]

Sample from each posterior, recommend item with highest sample. Update: \(\alpha_i \leftarrow \alpha_i + \text{click}\), \(\beta_i \leftarrow \beta_i + (1-\text{click})\). Naturally balances exploration/exploitation via uncertainty.

Contextual Bandits (LinUCB): Linear model with upper confidence bound: \[\begin{equation} \text{score}(a) = \mathbf{w}^T \mathbf{x}_a + \alpha \sqrt{\mathbf{x}_a^T A^{-1} \mathbf{x}_a} \end{equation}\]

UCB term encourages exploring items with high uncertainty. Generalizes to contexts (user features, time of day).

17.2 Debiasing

Selection Bias: Users only see items the system showed–logged data is biased by previous model.

Solutions:

Inverse propensity scoring: Weight loss by \(1/P(\text{item shown})\) to correct for exposure
Position bias correction: Items at top clicked more regardless of quality (unbiased learning-to-rank)
Randomized exploration: Show random items to small fraction of users (collect unbiased labels)

Popularity Bias: Popular items over-recommended (rich get richer), hurts diversity and long-tail discovery.

Solutions:

Diversity re-ranking: Penalize items similar to already-shown, boost under-represented categories
Calibration: Match recommendation distribution to user’s historical genre/category preferences

18 Production System Design

18.1 Two-Stage Architecture

Production systems use multi-stage pipelines to balance accuracy and latency at scale.

Stage 1: Candidate Generation (Retrieval)

Goal: Narrow millions of items to hundreds (recall-focused)
Methods: Two-tower models, ANN search (FAISS), matrix factorization
Latency: \(<\) 10ms

Stage 2: Ranking

Goal: Score candidates with full features (precision-focused)
Methods: DLRM, DeepFM, Wide & Deep
Latency: \(<\) 50ms for top-100 candidates

Stage 3: Re-ranking (optional)

Business logic: diversity, freshness, A/B test buckets
Calibration: adjust scores to match user preferences

Example System (100M users × 1M items):

Offline (daily):

Train two-tower on previous day’s interactions (1:10 negative sampling)
Export item embeddings \(\mathbf{v}_1, \ldots, \mathbf{v}_{1M}\)
Build FAISS index: IndexIVFPQ (quantized, partitioned for sub-linear search)
Train ranking model (DLRM) on full features

Online (per request, \(<\) 100ms):

User encoding (5ms): \(\mathbf{u} = f_{\text{user}}(\text{features})\) via GPU
Retrieval (10ms): FAISS search for top-500 by \(\mathbf{u}^T \mathbf{v}_i\)
Feature hydration (20ms): Fetch user/item features from Redis cache
Ranking (30ms): Score top-500 with DLRM (batched GPU inference)
Re-ranking (5ms): Apply business rules, return top-10

Infrastructure:

User sharding across servers (partition by user ID)
Redis caching for features (LRU for popular items)
TensorFlow Serving / TorchServe for batched model inference
A/B testing framework to route traffic to model versions

18.2 Negative Sampling Strategies

Random negatives: Sample unobserved items uniformly. Fast but weak signal.

Hard negatives: More informative but expensive to collect:

Items user viewed but didn’t click (in-session negatives)
Items similar to positive but not interacted
High-predicted score items user actually rejected

Popularity-based: Sample proportional to \(P(i)^{0.75}\) (like word2vec negative sampling).

Warning

Common Pitfall: Random-only negatives → model learns popularity, not preferences. Always mix: 70% random + 30% hard negatives.

18.3 Online vs Offline Metrics

Offline metrics (logged data):

AUC, logloss, NDCG@K on held-out test set
Fast iteration, but subject to biases

Online metrics (A/B test):

CTR (click-through rate), conversion rate, watch time
User engagement: session length, return rate
Business: revenue, retention

Why they diverge:

Distribution shift: Test set from past, deploy to future users
Selection bias: Logged data only contains previously shown items
Feedback loops: Recommendations change user behavior

Best practice: Use offline metrics for fast iteration, validate with online A/B before full rollout.

19 Interview Questions

Note

Q1: User-based vs item-based CF–when to use each?

A: Item-based when \(|\mathcal{I}| \ll |\mathcal{U}|\) (e-commerce): item similarities stable, precomputable offline, \(O(n^2)\) vs \(O(m^2)\). User-based for small user bases (enterprise B2B). Both fail on cold start and lack generalization.

Q2: Why ALS instead of SGD for matrix factorization?

A: Parallelization–with \(P\) fixed, all \(\mathbf{q}_i\) solved independently (critical for Spark). Handles confidence-weighted implicit feedback naturally. Closed-form (no LR tuning). SGD faster per-iteration but sequential.

Q3: How does DeepFM handle sparse features efficiently?

A: Embedding per categorical value. FM identity: \(\sum_{i<j}\langle\mathbf{v}_i,\mathbf{v}_j\rangle x_ix_j = \frac{1}{2}(\|\sum_i\mathbf{v}_ix_i\|^2 - \sum_i\|\mathbf{v}_ix_i\|^2)\) → \(O(kN)\) for \(N\) non-zero features.

Q4: Why two-tower instead of single model? Trade-offs?

A: Precompute item embeddings → ANN retrieval \(O(\log n)\) vs \(O(n)\). Trade-off: dot product limits interaction modeling. Solution: two-tower for retrieval (M→100s), re-rank with DeepFM/DLRM.

Q5: DLRM vs DeepFM–which scales better?

A: DLRM scales better: model parallelism (embedding tables across GPUs), data parallelism (MLPs), all-to-all communication. DeepFM simpler for single-GPU scale.

Q6: How to handle implicit feedback vs explicit?

A: Confidence-weighted: \(c_{ui} = 1 + \alpha \cdot \text{count}\). Negative sampling (1:4-10 ratio). BPR loss assumes observed \(>\) unobserved. Absence \(\neq\) dislike, just lack of exposure.

Q7: NDCG vs MAP–when to use each?

A: NDCG for graded relevance (1-5 stars), MAP for binary (relevant/not). Both penalize lower ranks. NDCG smoother decay (\(\log_2\)), more common in practice.

Q8: Design a real-time recommender for 100M users × 1M items.

A: Offline: train two-tower, export embeddings, build FAISS IndexIVFPQ, train DLRM. Online (\(<\)100ms): user encoding (5ms) → FAISS top-500 (10ms) → Redis feature hydration (20ms) → DLRM ranking (30ms) → business rules (5ms). Scale: user sharding, Redis caching, batched inference.

Q9: Cold start solutions for new streaming service?

A: Launch: popularity + onboarding (genre selection) + content features (metadata + pretrained embeddings). Growth: blend CF as interactions accumulate + Thompson sampling exploration. Mature: two-tower retrieval + sequence models (what user just watched).

Q10: Why do offline and online metrics diverge?

A: Distribution shift (test=past, deploy=future), selection bias (only previously shown items), feedback loops (recommendations change behavior). Use offline for fast iteration, A/B for validation.

19.1 Quick Reference

Method	Pros	Cons	Use Case
Item-based CF	Stable, offline compute	No generalization, cold start	E-commerce
Matrix Factor.	Scalable, generalizes	Linear interactions only	Baseline CF
NCF	Non-linear interactions	Slower than MF	When MF plateaus
DeepFM	Auto feature crossing	More complex	CTR with many features
DLRM	Billion-scale	Complex distributed train	Ads/feed at scale
Two-tower	Fast ANN retrieval	Limited interactions	Candidate generation

19.2 Key Formulas

MF: \(\hat{r}_{ui} = \mathbf{p}_u^T \mathbf{q}_i\) ALS: \(\mathbf{p}_u = (Q^TQ + \lambda I)^{-1}Q^T\mathbf{r}_u\) Two-tower: \(\mathcal{L} = -\log\frac{e^{\mathbf{u}^T\mathbf{v}_+}}{e^{\mathbf{u}^T\mathbf{v}_+} + \sum_j e^{\mathbf{u}^T\mathbf{v}_j}}\)

NDCG: \(\text{DCG@K} = \sum_{i=1}^K \frac{2^{\text{rel}_i}-1}{\log_2(i+1)}\) FM: \(\sum_{i<j}\langle\mathbf{v}_i,\mathbf{v}_j\rangle = \frac{1}{2}(\|\sum_i\mathbf{v}_i\|^2 - \sum_i\|\mathbf{v}_i\|^2)\)