9 Chapter 8: Embeddings for NLP

10 Foundational Concepts

10.1 What are Embeddings?

An embedding is a mapping from discrete objects (words, tokens, documents) to continuous vector spaces \(\mathbb{R}^d\), where \(d\) is the embedding dimension.

Goal: Capture semantic similarity via geometric proximity \[\text{sim}(v_{\text{king}}, v_{\text{queen}}) > \text{sim}(v_{\text{king}}, v_{\text{car}})\]

10.2 Why Embeddings Matter

Curse of dimensionality: One-hot vectors \(\in \{0,1\}^V\) where \(V\) = vocabulary size (50k+) \(\rightarrow\) all pairs orthogonal, no similarity structure
Generalization: Similar words share similar representations \(\rightarrow\) enables transfer learning
Compositionality: Vector arithmetic captures semantic relationships

Note

Core Insight: Dense low-dimensional representations (\(d = 100\text{--}1000\)) can capture semantic structure that sparse high-dimensional one-hot vectors (\(d = |V| \approx 50\text{k}\)) cannot.

11 Classical Approaches: Pre-Deep Learning

11.1 TF-IDF: Term Frequency-Inverse Document Frequency

Notation:

\(N\) = total number of documents in corpus
\(w\) = word (term)
\(d\) = specific document
\(\text{TF}(w, d)\) = frequency of word \(w\) in document \(d\)
\(\text{DF}(w)\) = number of documents containing word \(w\)

Formula: \[\text{TF-IDF}(w, d) = \text{TF}(w, d) \times \log \frac{N}{\text{DF}(w)}\]

Intuition:

TF (Term Frequency): Frequent in document \(\rightarrow\) important to that document
IDF (Inverse Document Frequency): Rare across corpus \(\rightarrow\) discriminative (penalizes common words like “the”, “and”)

Document representation: \(v_d = [\text{TF-IDF}(w_1, d), \ldots, \text{TF-IDF}(w_V, d)]\)

Example

TF-IDF Calculation:

Corpus: 1000 documents

Word “neural”:

Appears 10 times in document \(d\)
Appears in 20 documents total
\(\text{TF-IDF}(\text{``neural''}, d) = 10 \times \log(1000/20) = 10 \times 1.7 = 17\)

Word “the”:

Appears 50 times in document \(d\)
Appears in 995 documents total
\(\text{TF-IDF}(\text{``the''}, d) = 50 \times \log(1000/995) = 50 \times 0.005 = 0.25\)

Result: “neural” has higher weight despite lower frequency.

Limitations:

High-dimensional and sparse (dimension = vocabulary size)
No semantic similarity (“king” and “queen” are orthogonal)
Bag-of-words: ignores word order and context

11.2 Dimensionality Reduction: LSA and PCA

11.2.1 Latent Semantic Analysis (LSA)

Goal: Reduce term-document matrix \(X \in \mathbb{R}^{V \times D}\) from \(V\) dimensions to \(k\) dimensions (\(k \ll V\))

Method: Singular Value Decomposition (SVD) \[X = U \Sigma V^T\] where:

\(U \in \mathbb{R}^{V \times V}\): left singular vectors (word space)
\(\Sigma \in \mathbb{R}^{V \times D}\): diagonal matrix of singular values
\(V \in \mathbb{R}^{D \times D}\): right singular vectors (document space)

Low-rank approximation: Keep top \(k\) components \[X_k = U_k \Sigma_k V_k^T\]

Word embedding: Row \(i\) of \(U_k\) (or \(U_k \Sigma_k\)) gives \(k\)-dimensional representation of word \(i\)

11.2.2 Connection to PCA

Principal Component Analysis (PCA): Find directions of maximum variance via eigendecomposition

\[\text{Cov}(X) = \frac{1}{n}X^T X = V \Lambda V^T\] where \(\Lambda\) = eigenvalues, \(V\) = eigenvectors (principal components)

SVD relationship: \[X^T X = (U \Sigma V^T)^T (U \Sigma V^T) = V \Sigma^T U^T U \Sigma V^T = V \Sigma^2 V^T\]

Thus: PCA eigenvectors = right singular vectors of SVD, and \(\Lambda = \Sigma^2\)

Note

Key Insight: LSA applies SVD to term-document matrix; PCA applies eigendecomposition to covariance matrix. They are mathematically equivalent when centered data is used.

Why LSA/PCA Work:

Captures latent “topics” (e.g., dimension 1 = “sports”, dimension 2 = “politics”)
Reduces noise from rare co-occurrences
Dense representation: 100k \(\rightarrow\) 300 dimensions

Limitations:

Linear model: Cannot capture complex nonlinear semantics
No context: Polysemy problem (“bank” river vs. finance gets same vector)
Batch method: Must recompute for new data (no incremental updates)

11.3 Other Dimensionality Reduction Methods

Multidimensional Scaling (MDS): Preserves pairwise distances in lower dimensions

Given distance matrix \(D\), find \(X \in \mathbb{R}^{n \times k}\) such that \(\|x_i - x_j\| \approx D_{ij}\)
Classical MDS = PCA when distances are Euclidean
Use in NLP: Visualization (2D/3D projections), not for learning embeddings

Isomap (Isometric Mapping): Nonlinear via geodesic distances

Builds neighborhood graph, computes shortest paths
Preserves intrinsic manifold geometry
Use in NLP: Rarely used (computationally expensive, requires dense similarity matrix)

t-SNE: Probabilistic nonlinear dimensionality reduction

Preserves local structure (nearby points stay nearby)
Heavy-tailed Student-t distribution in low-dim space
Use in NLP: Visualization only (2D/3D scatter plots of Word2Vec/BERT embeddings)
Not for embeddings: Loses global structure, non-deterministic

Note

Modern Insight: Neural embeddings (Word2Vec, BERT) implicitly perform nonlinear dimensionality reduction via gradient descent–far more powerful than linear methods (LSA/PCA).

12 Neural Embeddings: Word2Vec Era

12.1 Word2Vec (2013): The Breakthrough

Core Idea: Learn embeddings by predicting context from words (or vice versa)

12.1.1 Skip-gram Architecture

Task: Given center word \(w_c\), predict context word \(w_o\) within window

Model: \[P(w_o | w_c) = \frac{\exp(u_{w_o}^T v_{w_c})}{\sum_{w \in V} \exp(u_w^T v_{w_c})}\]

Notation:

\(v_{w_c} \in \mathbb{R}^d\): center (input) embedding of word \(w_c\)
\(u_{w_o} \in \mathbb{R}^d\): context (output) embedding of word \(w_o\)
\(V\): vocabulary (size \(|V|\) = 50k–100k)
Denominator: softmax normalization over entire vocabulary

Training Objective: Maximize log-likelihood \[\mathcal{L} = \sum_{t=1}^T \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)\] where \(c\) = context window size (e.g., 5)

Example

Skip-gram Training Example:

Sentence: “I love natural language processing”

Window size \(c = 2\)

Center word: “natural”

Context words (within window): “love”, “language”

Objective: maximize \[\log P(\text{``love''} | \text{``natural''}) + \log P(\text{``language''} | \text{``natural''})\]

Problem: Full Softmax is Intractable

Computing \(\sum_{w \in V} \exp(u_w^T v_{w_c})\) requires iterating over 50k+ words
Gradient computation: \(O(|V|)\) per training example \(\rightarrow\) too slow!

12.1.2 Negative Sampling: Efficient Training

Key Innovation: Replace expensive softmax with binary classification

Intuition:

Instead of: “predict correct context word among 50k candidates”
Do: “distinguish actual context word (positive) from random words (negatives)”

Objective: For each (center, context) pair, sample \(k\) negative words \[\mathcal{L}_{\text{NS}} = \log \sigma(u_{w_{\text{pos}}}^T v_c) + \sum_{i=1}^k \mathbb{E}_{w_i \sim P_n(w)}[\log \sigma(-u_{w_i}^T v_c)]\]

Notation breakdown:

\(w_{\text{pos}}\): actual context word (positive example)
\(w_i\): negative example word sampled from noise distribution \(P_n(w)\)
\(k\): number of negative samples (typically 5–20)
\(\sigma(x) = \frac{1}{1+e^{-x}}\): sigmoid function
Goal: Maximize \(\sigma(u_{\text{pos}}^T v_c)\) (push positive pair score \(\rightarrow\) 1) and maximize \(\sigma(-u_{\text{neg}}^T v_c)\) (push negative pair score \(\rightarrow\) 0)

Negative Sampling Distribution: \[P_n(w) = \frac{f(w)^{3/4}}{\sum_{w'} f(w')^{3/4}}\] where \(f(w)\) = frequency of word \(w\) in corpus

Why \(3/4\) power?

Unigram: \(f(w) / \sum f\) \(\rightarrow\) over-samples frequent words (“the”, “is”)
Uniform: \(1/|V|\) \(\rightarrow\) under-samples frequent words
\(f(w)^{3/4}\): compromise that down-weights very frequent, up-weights rare

Example

Negative Sampling Example:

Positive pair: (“natural”, “language”)

\(u_{\text{language}}^T v_{\text{natural}} = 2.5\) (high score, good)

Negative samples (\(k=3\)): sample “car”, “banana”, “Tuesday” from \(P_n\)

\(u_{\text{car}}^T v_{\text{natural}} = -1.2\) (low score, good)
\(u_{\text{banana}}^T v_{\text{natural}} = 0.3\) (slightly positive, should be lower)
\(u_{\text{Tuesday}}^T v_{\text{natural}} = -0.8\) (low score, good)

Loss pushes “banana” score down via gradient update.

Gradient Computation: \[\begin{align*} \nabla_{v_c} \log \sigma(u_{\text{pos}}^T v_c) & = (1 - \sigma(u_{\text{pos}}^T v_c)) u_{\text{pos}} \\ \nabla_{v_c} \log \sigma(-u_{\text{neg}}^T v_c) & = -\sigma(u_{\text{neg}}^T v_c) u_{\text{neg}} \end{align*}\]

Complexity: \(O(k)\) instead of \(O(|V|)\) \(\rightarrow\) massive speedup!

Two Embedding Matrices:

Input embeddings \(V \in \mathbb{R}^{|V| \times d}\): center word vectors
Output embeddings \(U \in \mathbb{R}^{|V| \times d}\): context word vectors
After training: use \(V\) alone, or average \((V + U)/2\)

12.2 Word2Vec Properties

1. Semantic Arithmetic \[v_{\text{king}} - v_{\text{man}} + v_{\text{woman}} \approx v_{\text{queen}}\]

Captures analogies: “man is to king as woman is to ___?”
Find: \(\operatorname*{arg\,max}_{w} \text{sim}(v_w, v_{\text{king}} - v_{\text{man}} + v_{\text{woman}})\)

2. Cosine Similarity Captures Semantic Relatedness \[\text{sim}(w_1, w_2) = \frac{v_{w_1}^T v_{w_2}}{\|v_{w_1}\| \|v_{w_2}\|}\]

3. Connection to Matrix Factorization

Key result (Levy & Goldberg 2014): Skip-gram with negative sampling implicitly factorizes a PMI (Pointwise Mutual Information) matrix

PMI definition: \[\text{PMI}(w, c) = \log \frac{P(w, c)}{P(w)P(c)} = \log \frac{\#(w, c) \cdot |D|}{\#(w) \cdot \#(c)}\] where \(\#(w, c)\) = count of \((w, c)\) pairs, \(|D|\) = total pairs

Word2Vec approximation: \[u_w^T v_c \approx \text{PMI}(w, c) - \log k\] where \(k\) = number of negative samples

Note

Interpretation: Word2Vec learns embeddings such that dot products approximate (shifted) log co-occurrence probabilities. This connects neural methods to classical count-based approaches.

12.3 CBOW and Other Variants

CBOW (Continuous Bag of Words): Predict center from context \[P(w_c | \text{context}) = \frac{\exp(v_{w_c}^T \bar{u}_{\text{context}})}{\sum_{w \in V} \exp(v_w^T \bar{u}_{\text{context}})}\] where \(\bar{u}_{\text{context}} = \frac{1}{2c}\sum_{j} u_{w_{t+j}}\) (average of context embeddings)

Skip-gram vs. CBOW:

Skip-gram: better for rare words (each occurrence is separate example)
CBOW: faster (one prediction per center word)
Skip-gram generally performs better, more commonly used

12.4 GloVe (2014): Global Vectors

Key Idea: Directly factorize co-occurrence matrix

Objective: \[J = \sum_{i,j=1}^V f(X_{ij})\left(w_i^T \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij}\right)^2\] where \(X_{ij}\) = co-occurrence count, \(f\) = weighting function

GloVe vs. Word2Vec:

GloVe: global corpus statistics (entire co-occurrence matrix)
Word2Vec: local context windows (stochastic SGD)
Empirically similar performance; GloVe often faster to converge

12.5 FastText (2017): Subword Embeddings

Relation to Word2Vec: FastText extends Word2Vec’s skip-gram architecture by representing words as bag of character n-grams instead of atomic tokens. Same training objective (negative sampling), but embeddings are compositional.

Innovation: Words as bag of character n-grams

Example: “unbelievable” with \(n=3\):

N-grams: <un, unb, nbe, bel, eli, lie, iev, eva, vab, abl, ble>
Plus full word: <unbelievable>

Embedding: \[v_{\text{unbelievable}} = \sum_{g \in \text{n-grams}(\text{``unbelievable''})} v_g\]

Training: Use Word2Vec skip-gram loss, but predict context from sum of n-gram embeddings instead of single word embedding.

Benefits over Word2Vec:

Morphology: “believable”, “unbelievable”, “believe” share n-grams → similar vectors
OOV: New word “believableness” \(\rightarrow\) compose from seen n-grams (Word2Vec: UNK token)
Rare words: Better representations via shared substructure
Typos: “beleivable” still close to “believable” (Word2Vec: fails)
Multilingual: Works well for morphologically rich languages (German, Turkish, Finnish)

Tradeoff: Larger memory footprint (store n-gram embeddings), slightly slower training.

13 Contextual Embeddings: Transformer Era

13.1 The Context Problem

Static embeddings (Word2Vec, GloVe): One vector per word type

Example

Polysemy Problem:

Sentence 1: “I went to the bank to deposit money”

Sentence 2: “I sat by the river bank”

Problem: Same vector \(v_{\text{bank}}\) for both meanings (financial institution vs. river edge)

Solution: Contextualized embeddings – different vector for each token occurrence

13.2 ELMo (2018): Bi-LSTM Contextualization

Architecture:

Character-level CNN \(\rightarrow\) initial word representations
Forward LSTM: \(\vec{h}_t = \text{LSTM}(\vec{h}_{t-1}, x_t)\)
Backward LSTM: \(\overleftarrow{h}_t = \text{LSTM}(\overleftarrow{h}_{t+1}, x_t)\)
Concatenate: \(h_t = [\vec{h}_t; \overleftarrow{h}_t]\)

Usage: Pre-train on language modeling, then concatenate to task model

Limitation: Sequential computation (LSTMs don’t parallelize)

13.3 BERT (2019): Bidirectional Transformer Embeddings

Architecture: Transformer encoder with self-attention

BERT embedding composition: token + position + segment embeddings

13.3.1 Input Representation

BERT combines three types of embeddings: \[\text{Input}_i = E_{\text{token}}(w_i) + E_{\text{position}}(i) + E_{\text{segment}}(s)\]

1. Token Embeddings

Learned lookup table: \(E_{\text{token}} \in \mathbb{R}^{V \times d}\)
Vocabulary size \(V\) (e.g., 30k), hidden dim \(d\) (768 for BERT-base)

2. Position Embeddings

Learned (not sinusoidal like original Transformer)
\(E_{\text{pos}} \in \mathbb{R}^{L \times d}\) where \(L\) = max sequence length (512)

3. Segment Embeddings

Distinguishes sentence A vs. B (ie., position relative to [SEP] as in [CLS] A [SEP] B [SEP])
\(E_{\text{seg}} \in \mathbb{R}^{2 \times d}\) (only 2 embeddings)

Special Tokens:

[CLS]: Classification token (prepended to sequence)
[SEP]: Separator between sentences
[MASK]: Masked token for MLM pre-training

13.3.2 Pre-training: Masked Language Modeling

Procedure:

Randomly select 15% of tokens
Of selected tokens: 80% \(\rightarrow\) [MASK], 10% \(\rightarrow\) random word, 10% \(\rightarrow\) unchanged
Predict original token

Loss: \[\mathcal{L}_{\text{MLM}} = -\sum_{i \in \text{selected}} \log P(w_i | w_1, \ldots, [MASK], \ldots, w_n)\]

Why 80/10/10 split?

100% [MASK]: train/test mismatch (never see [MASK] during fine-tuning)
10% random: forces model to use context (can’t just memorize)
10% unchanged: keeps representation of real words

Note

Dynamic Masking: Selection is stochastic and recomputed for each training batch. The same sentence “I love NLP” will have different tokens selected in different epochs:

Epoch 1: mask “love” \(\rightarrow\) “I [MASK] NLP”
Epoch 2: mask “I” \(\rightarrow\) “[MASK] love NLP”
Epoch 3: leave “NLP” unchanged (selected but 10% rule) \(\rightarrow\) “I love NLP”

This prevents overfitting and ensures every token is trained in multiple contexts. Fixed masking would let the model memorize mask positions instead of learning bidirectional representations.

13.3.3 The [CLS] Token: Sentence-Level Representation

The [CLS] token is prepended to every input sequence and serves as an aggregate representation of the entire sequence.

Why [CLS] is important:

Design purpose: Unlike other tokens tied to specific words, [CLS] has no lexical meaning–it exists solely to aggregate information from all other tokens via self-attention
Bidirectional context: Through 12 layers of self-attention, \(h_{\text{[CLS]}}\) attends to all tokens in both directions, capturing sentence-level semantics
Classification head: BERT’s pre-training includes Next Sentence Prediction (NSP) trained on \(h_{\text{[CLS]}}\), making it optimized for sequence-level tasks
Fine-tuning: For downstream classification tasks (sentiment, NLI, QA), add a linear layer on top of \(h_{\text{[CLS]}}\) and fine-tune end-to-end

Note

Key Distinction: While token embeddings \(h_i\) represent individual words in context, \(h_{\text{[CLS]}}\) represents the entire sequence. This makes BERT naturally suited for both token-level tasks (NER, POS tagging) and sequence-level tasks (classification, retrieval).

13.3.4 BERT Output: Extracting Embeddings

Forward pass: \[\begin{align*} & \text{Input: [CLS] I love natural language [SEP]} \\ & \quad \downarrow \\ & \text{Token + Position + Segment embeddings} \\ & \quad \downarrow \\ & \text{12 layers of Self-attention + FFN} \\ & \quad \downarrow \\ & \text{Output: } h_{\text{[CLS]}}, h_{\text{I}}, h_{\text{love}}, h_{\text{natural}}, h_{\text{language}}, h_{\text{[SEP]}} \end{align*}\]

Extracting embeddings:

Sentence-level: Use \(h_{\text{[CLS]}}\) (designed for classification)
Token-level: Use \(h_i\) for token \(i\)
Mean pooling: \(\frac{1}{n}\sum_{i=1}^n h_i\) (average all non-special tokens)

13.4 Modern LLMs: LLaMA, GPT-4

Architecture: Decoder-only transformers (autoregressive LM)

Key Differences from BERT:

Causal attention: Token \(i\) only sees tokens \(1, \ldots, i\) (not future)
Training objective: Next-token prediction (generation), not masked LM
No [CLS] token: No dedicated sequence-level representation
RoPE: Rotary position embeddings (not learned)

Why LLMs are suboptimal for embeddings:

Causal attention limits context (only left-to-right)
Trained for generation, not semantic similarity
Must use ad-hoc pooling (last token, mean-pooling)

14 Dedicated Embedding Models

14.1 Sentence-BERT (SBERT, 2019)

Problem: BERT requires \(O(n^2)\) comparisons

Comparing 10k sentences: 50M BERT forward passes (intractable)

Solution: Siamese network with contrastive learning

Architecture:

Encode sentence A: \(u = \text{Pool}(\text{BERT}(A))\) (e.g., mean-pool or [CLS] token)
Encode sentence B: \(v = \text{Pool}(\text{BERT}(B))\)
Compare: \(\text{sim}(u, v) = \frac{u^T v}{\|u\| \|v\|}\)

What Gets Trained:

BERT’s weights are fine-tuned (not frozen!)
Pooling has no parameters (just extracts [CLS] or averages token embeddings)
Loss compares sentence embeddings to update BERT such that similar sentences → similar vectors
Base BERT (trained on MLM) doesn’t optimize for sentence similarity; SBERT does

Training Loss (Triplet Loss): \[\mathcal{L} = \max(0, \|u_a - u_p\|^2 - \|u_a - u_n\|^2 + \epsilon)\] where \(u_a\) = anchor, \(u_p\) = positive (similar sentence), \(u_n\) = negative (dissimilar), \(\epsilon\) = margin

Alternative: Contrastive Loss (InfoNCE): \[\mathcal{L} = -\log \frac{\exp(\text{sim}(u_a, u_p) / \tau)}{\exp(\text{sim}(u_a, u_p) / \tau) + \sum_{i=1}^N \exp(\text{sim}(u_a, u_{n_i}) / \tau)}\] Push positive pairs together, negatives apart in embedding space.

Key Insight: Fine-tuning BERT with similarity-based loss (triplet/contrastive) makes embeddings optimized for cosine similarity, unlike base BERT trained on MLM.

Result: Embed once, compare via cosine similarity

Scales to millions of sentences (FAISS, Pinecone)
1000× faster than pairwise BERT comparisons

14.2 Modern Embedding Models (2023+)

Open-source: bge-large, e5-mistral-7b, gte-large, nomic-embed-text

Commercial: OpenAI text-embedding-3, Cohere embed-v3, Voyage AI

Training Improvements:

Hard negative mining: BM25 top-k as hard negatives
In-batch negatives: 511 negatives per example (batch size 512)
Multi-task training: Retrieval, QA, semantic similarity, classification
Matryoshka embeddings: Single model, multiple dimensions

Matryoshka Representation Learning:

\(u \in \mathbb{R}^{1024}\) \(\rightarrow\) can truncate to 768, 512, 256, 128 dims
Training: nested loss at each dimensionality \[\mathcal{L} = \sum_{d \in \{128, 256, 512, 1024\}} \mathcal{L}_d(u[:d])\]
Use case: 128-dim for fast search, 1024-dim for reranking

14.3 ColBERT: Late Interaction

Problem: Mean-pooling loses token-level information

Solution: Keep token embeddings, compute MaxSim score at query time

Architecture:

Query tokens: \(Q = \text{BERT}(q) \in \mathbb{R}^{|q| \times d}\) (e.g., 5 tokens \(\times\) 128 dim)
Document tokens: \(D = \text{BERT}(d) \in \mathbb{R}^{|d| \times d}\) (e.g., 100 tokens \(\times\) 128 dim)
Score: \(\sum_{i=1}^{|q|} \max_{j=1}^{|d|} (Q_i \cdot D_j)\)

Intuition: For each query token, find most similar document token; sum over all query tokens. This captures fine-grained keyword matching while preserving semantic understanding.

Production Implementation:

Offline indexing:
- Encode all documents: \(D_i = \text{BERT}(d_i)\) for \(i=1,\ldots,N\)
- Store token embeddings in compressed index (e.g., 100 tokens/doc \(\times\) 128 dim)
- Build approximate nearest neighbor index (FAISS IVF) for candidate retrieval
Online query:
- Encode query: \(Q = \text{BERT}(q)\) (fast, only 5-10 tokens typically)
- Candidate retrieval: Use ANN to get top-1000 documents (filter by first-stage similarity)
- MaxSim reranking: Compute exact MaxSim score for each of 1000 candidates: \[\text{score}(q, d_i) = \sum_{k=1}^{|q|} \max_{j=1}^{|d_i|} (Q_k \cdot D_{i,j})\]
- Return top-K by MaxSim score

Why ColBERT is Slower than SBERT:

Method	Storage per Doc	Query-time Computation
SBERT	1 vector (768 dims)	\(O(1)\) cosine similarity per doc
ColBERT	\(\sim\)100 vectors (128 dims each)	\(O(\|q\| \times \|d\|)\) MaxSim per doc
SBERT total	768 floats	768 dot products
ColBERT total	12,800 floats (16\(\times\) larger)	5 \(\times\) 100 = 500 dot products + max ops

Concrete Example:

Query: "python async await" (5 tokens)
Document: 100 tokens
SBERT: 1 cosine similarity (768 ops)
ColBERT: For each of 5 query tokens, compute dot product with 100 doc tokens, take max → 5 \(\times\) 100 = 500 dot products + 5 max operations
Even with precomputed doc tokens, MaxSim is \(\sim\)3-5\(\times\) slower per document
For 1M documents: SBERT can use ANN (sublinear); ColBERT needs two-stage (ANN candidates → MaxSim rerank)

When to use ColBERT:

Keyword-heavy retrieval: Code search, technical docs, legal documents
Multi-hop reasoning: Token-level alignment matters (e.g., "red hat" vs "hat" + "red" elsewhere)
Hybrid systems: SBERT for first-pass retrieval (top-10k), ColBERT for reranking (top-100)
Better accuracy: Typically 5-10% improvement over SBERT, at cost of 3-5\(\times\) latency

Optimization Techniques:

Compress doc embeddings to lower dimensions (64 or 32 dims)
Prune document tokens (keep only top-K most salient)
GPU batching for MaxSim computation
Cascade: SBERT (1M→10k) → ColBERT (10k→100) → cross-encoder (100→10)

14.4 Cross-Encoders: Maximum Accuracy

Architecture: Encode query + document together in single BERT pass

How it works: \[\begin{align*} & \text{Input: [CLS] query [SEP] document [SEP]} \\ & \quad \downarrow \\ & \text{BERT encoder (full cross-attention)} \\ & \quad \downarrow \\ & h_{\text{[CLS]}} \rightarrow \text{Classification head} \rightarrow \text{Relevance score} \end{align*}\]

Key difference from bi-encoders (SBERT):

Bi-encoder (SBERT): Encode query and document separately, compare embeddings
- score = cosine(embed(query), embed(doc))
- Pre-compute document embeddings, fast retrieval
Cross-encoder: Encode query + document jointly
- score = classifier(BERT([query; doc]))
- Full attention between query and document tokens
- Cannot pre-compute (must encode each pair)

Example

Cross-Encoder Example:

Query: “What is machine learning?”

Document 1: “Machine learning is a subset of AI that enables computers to learn from data.”

Cross-encoder forward pass:

    [CLS] What is machine learning? [SEP]
          Machine learning is a subset... [SEP]
                      ↓
        Self-attention (query tokens attend to doc tokens)
                      ↓
            h_[CLS] → classifier → score = 0.92 (relevant)

Document 2: “The weather today is sunny and warm.”

Cross-encoder forward pass:

    [CLS] What is machine learning? [SEP]
          The weather today... [SEP]
                      ↓
        Self-attention detects semantic mismatch
                      ↓
            h_[CLS] → classifier → score = 0.05 (not relevant)

Why cross-encoders are more accurate:

Token-level interaction: Query word “learning” directly attends to document word “learning”
Full context: Sees both query and document simultaneously
Richer features: 12+ layers of cross-attention

Why cross-encoders are slow:

Cannot pre-compute: each (query, doc) pair needs full BERT pass
Complexity: \(O(n)\) BERT calls for \(n\) documents
Retrieval: 1M documents \(\rightarrow\) hours (vs. seconds for dense retrieval)

Typical RAG pipeline:

Stage	Method	Candidates	Speed	Accuracy
1. Retrieval	Dense (SBERT)	1M \(\rightarrow\) 1000	Fast (ms)	Good
2. Rerank (optional)	ColBERT	1000 \(\rightarrow\) 100	Medium	Better
3. Final rerank	Cross-encoder	100 \(\rightarrow\) 10	Slow (1s)	Best

15 Connection to RAG

15.1 RAG Pipeline

1. Indexing Phase: \[\text{Documents} \rightarrow \text{Chunking} \rightarrow \text{Embedding} \rightarrow \text{Vector DB (FAISS/Pinecone)}\]

2. Retrieval Phase: \[\text{Query} \rightarrow \text{Embedding} \rightarrow \text{Similarity Search} \rightarrow \text{Top-K docs}\]

3. Generation Phase: \[\text{Top-K docs + Query} \rightarrow \text{LLM (GPT-4, LLaMA)} \rightarrow \text{Answer}\]

15.2 Embedding Models in RAG

Asymmetric search:

Query: short (5–20 tokens)
Document: long (128–512 tokens)
Need models trained for asymmetric retrieval (e5-large, bge-large)

Chunking strategies:

Fixed size: 256 tokens with 50-token overlap
Semantic: split by paragraphs, sentences
Hierarchical: summaries + full chunks

15.3 Multi-Stage Retrieval

Two-stage retrieval:

First-pass: Dense retrieval (SBERT) \(\rightarrow\) top-1000
Rerank: ColBERT \(\rightarrow\) top-100
Optional: Cross-encoder \(\rightarrow\) top-10

Trade-off:

Dense: fast, good recall
ColBERT: slower, better precision
Cross-encoder: slowest, best precision

16 Interview Cheat Phrases

TF-IDF to embeddings:

“TF-IDF represents documents as sparse vectors with no semantic similarity. LSA applies SVD to reduce dimensionality, capturing latent topics. Word2Vec replaced this linear factorization with neural networks trained via negative sampling, learning dense embeddings where cosine similarity captures semantics.”
Word2Vec negative sampling:

“Word2Vec skip-gram predicts context from center word. Negative sampling replaces expensive softmax with binary classification: maximize score of positive pairs, minimize score of k negative samples drawn from noise distribution. This is equivalent to implicit factorization of a shifted PMI matrix.”
BERT vs. Word2Vec:

“Word2Vec produces static embeddings–‘bank’ has one vector regardless of context. BERT uses self-attention to create contextualized embeddings where each token’s representation depends on surrounding tokens. This solves polysemy via bidirectional Transformers and MLM pre-training.”
[CLS] token importance:

“The [CLS] token aggregates information from all other tokens through self-attention layers, making it a natural sentence-level representation. BERT’s NSP pre-training optimizes [CLS] for sequence classification, which is why it’s used for embeddings and downstream tasks.”
BERT vs. LLaMA for embeddings:

“BERT’s bidirectional attention and [CLS] token make it naturally suited for embeddings. LLaMA is causal (decoder-only), trained for generation, not similarity. Dedicated embedding models like SBERT, trained with contrastive loss, outperform both for retrieval tasks.”
ColBERT:

“ColBERT keeps token-level embeddings and uses late interaction (MaxSim over all token pairs). This captures fine-grained matching like BM25 while maintaining semantic understanding. It’s used as a reranker in RAG pipelines for precision-critical retrieval.”
Cross-encoder vs. bi-encoder:

“Bi-encoders encode query and document separately, enabling fast retrieval via pre-computed embeddings. Cross-encoders encode them jointly with full cross-attention, achieving higher accuracy but requiring \(O(n)\) forward passes. Use bi-encoders for retrieval, cross-encoders for final reranking.”
Dimensionality reduction:

“PCA finds linear projections via SVD. Neural embeddings learn nonlinear manifolds via gradient descent. Word2Vec with negative sampling implicitly factorizes a PMI matrix but with far more expressive power than PCA.”

17 Summary: Key Takeaways

Evolution: Count-based (TF-IDF) \(\rightarrow\) Matrix factorization (LSA/PCA) \(\rightarrow\) Neural static (Word2Vec/GloVe) \(\rightarrow\) Contextualized (BERT) \(\rightarrow\) Task-optimized (SBERT, BGE)
Word2Vec foundation: Negative sampling made neural embeddings tractable. Understanding this loss function explains the entire embedding paradigm.
Context is critical: Static embeddings fail on polysemy; transformers solve this with self-attention and bidirectional context
Training objective matters: Contrastive loss (SBERT) \(>\) causal LM loss (GPT) for similarity tasks
RAG architecture: Dense retrieval (SBERT) for recall \(\rightarrow\) ColBERT for precision \(\rightarrow\) cross-encoder for final reranking
Modern best practice: Use dedicated embedding models (SBERT, E5, BGE) trained with hard negatives and multi-task objectives. Fine-tune on domain data for 10–20% improvement.
[CLS] token: BERT’s [CLS] token is specifically designed as an aggregate sequence representation through bidirectional self-attention, making BERT naturally suited for embedding tasks unlike causal LLMs.