8 Chapter 7: Tokenization for Large Language Models
9 Tokenization Fundamentals
9.1 What is Tokenization?
Tokenization is the process of converting raw text into a sequence of discrete units (tokens) that can be mapped to indices in a fixed vocabulary.
\[\text{Text: ``Hello world''} \xrightarrow{\text{tokenize}} [\text{``Hello''}, \text{``\_world''}] \xrightarrow{\text{encode}} [15496, 995]\]
Whitespace markers: Many modern subword tokenizers encode leading whitespace explicitly, e.g., SentencePiece uses a dedicated U+2581 whitespace marker and GPT-2 style BPE uses a dedicated leading-space marker (often rendered as a visible “space” character), instead of relying on a separate “split on spaces” pre-tokenizer.
Why tokenization matters:
Defines the granularity of language modeling (character, subword, word)
Determines vocabulary size and thus embedding matrix size
Affects sequence length and attention complexity
Impacts model’s ability to handle rare words, typos, multilingual text
Directly influences perplexity, KV cache size, inference latency
9.2 Tokenization Hierarchy
| Granularity | Vocab Size | Seq Length | Coverage |
|---|---|---|---|
| Character | Small (100–500) | Very long | Perfect |
| Subword (BPE/Unigram) | Medium (10k–100k) | Medium | High |
| Word | Large (100k–1M) | Short | Poor (OOV) |
| Byte | Minimal (256) | Very long | Perfect |
Core Tradeoff: Smaller vocabulary → longer sequences → more attention computation. Larger vocabulary → shorter sequences → bigger embedding matrices.
10 Byte Pair Encoding (BPE)
10.1 Algorithm
BPE is a greedy, bottom-up algorithm that iteratively merges the most frequent adjacent token pairs.
Initialize: Start with character-level vocabulary
Count: Find most frequent adjacent pair \((a, b)\) in corpus
Merge: Replace all occurrences of \((a, b)\) with new token \(ab\)
Repeat: Until vocabulary reaches target size
BPE Merge Example:
Corpus: low low low lower lowest
Iteration 1: Most frequent pair = (l, o) → merge to lo
Result: lo w lo w lo w lo wer lo west
Iteration 2: Most frequent pair = (lo, w) → merge to low
Result: low low low lower lowest
Final vocabulary: {l, o, w, e, r, s, t, lo, low, lower, lowest}
10.2 Properties
Deterministic: Given corpus statistics, merges are fixed
Greedy: Locally optimal (maximizes frequency at each step)
No probabilistic model: Pure frequency-based heuristic
Compression-focused: Minimizes total token count on training corpus
BPE Limitation: Greedy merging can produce suboptimal segmentations for rare words. Example: ‘‘unprecedented’’ might segment as [‘‘un’’, ‘‘pre’’, ‘‘ced’’, ‘‘ented’’] instead of meaningful morphemes.
10.3 Byte-Level BPE (GPT-2 Style)
Byte-level BPE operates on UTF-8 bytes instead of Unicode characters:
Base vocabulary = 256 (all possible bytes)
No out-of-vocabulary (OOV) issues by construction
Handles any Unicode text, emojis, rare scripts
Used in GPT-2 and OpenAI’s
tiktoken-style tokenizers (e.g., GPT-3.5/4 families)
Why Bytes?
Character ‘‘é’’ (U+00E9):
UTF-8 encoding:
0xC3 0xA9(2 bytes)Byte-level BPE sees: [
195,169]Can be merged if frequent: [
‘‘é’’]
This ensures universal coverage without maintaining explicit character sets.
11 WordPiece (BERT-Style)
11.1 What it is
WordPiece is a subword tokenization scheme popularized by BERT. It learns a fixed vocabulary of subword units and uses a deterministic, greedy segmentation at inference time (often “longest-match-first”).
11.2 How it tokenizes
Words are split into subword pieces; continuation pieces are typically marked with a prefix like
##If a word cannot be fully decomposed into known pieces, it maps to
[UNK](no universal coverage unless augmented with byte fallback)
WordPiece example (BERT convention):
‘‘unaffable’’ \(\rightarrow\) [‘‘un’’, ‘‘##aff’’, ‘‘##able’’]
11.3 Relation to BPE
Both WordPiece and BPE produce subword vocabularies, but they differ in how merges/pieces are chosen during training (BPE: greedy frequency merges; WordPiece: pieces chosen to improve a likelihood objective) and in conventions like ## continuation markers and [UNK] fallback.
12 Unigram Language Model Tokenization
12.1 Probabilistic Formulation
Unlike BPE’s greedy frequency counting, Unigram LM models tokenization as a probabilistic process.
Given vocabulary \(V\) and text \(X\), find the segmentation \(\mathbf{x} = (x_1, \ldots, x_n)\) that maximizes: \[P(\mathbf{x}) = \prod_{i=1}^n P(x_i)\] where \(P(x_i)\) is the unigram probability of token \(x_i\) in \(V\).
12.2 Training Algorithm
Initialize: Start with large vocabulary (all substrings up to length \(k\))
EM Training:
E-step: For each word, find the maximum likelihood segmentation using Viterbi
M-step: Update token probabilities based on segmentation frequencies
Prune: Remove tokens that minimally affect likelihood (typically remove 10–20%)
Repeat: Until vocabulary reaches target size
Key Difference from BPE:
BPE: Greedy local merges (bottom-up)
Unigram LM: Global likelihood optimization (top-down pruning)
Unigram LM evaluates entire segmentations, not just adjacent pairs
12.3 Viterbi Decoding
To find the best segmentation, use dynamic programming:
\[\text{best}[i] = \max_{j < i} \left(\text{best}[j] + \log P(X[j:i])\right)\]
where \(X[j:i]\) is substring from position \(j\) to \(i\), and \(\text{best}[i]\) stores the maximum log-probability for segmenting \(X[0:i]\).
Algorithm:
Initialize: \(\text{best}[0] = 0\) (empty string has probability 1)
For each position \(i = 1\) to \(n\):
Try all substrings \(X[j:i]\) for \(j < i\) that exist in vocabulary \(V\)
Compute \(\text{best}[j] + \log P(X[j:i])\) for each valid \(j\)
Set \(\text{best}[i] = \max_j (\text{best}[j] + \log P(X[j:i]))\)
Backtrack to recover the segmentation
Complexity: \(O(n^2 \cdot |V|)\) per word, where \(n\) is word length.
Viterbi Example: Segmenting “newer”
Vocabulary with log-probabilities (trained from corpus):
\(\log P(\texttt{``new''}) = -1.0\) (common word, high prob)
\(\log P(\texttt{``er''}) = -2.0\) (common suffix)
\(\log P(\texttt{``newer''}) = -3.5\) (less frequent as single token)
\(\log P(\texttt{``n''}) = -4.0\), \(\log P(\texttt{``e''}) = -4.0\), \(\log P(\texttt{``w''}) = -4.0\), \(\log P(\texttt{``r''}) = -4.0\) (single chars)
Dynamic Programming Table:
| \(i\) | \(\text{best}[i]\) | Best segmentation to position \(i\) |
|---|---|---|
| 0 | 0.0 | [] |
| 1 | -4.0 | [‘‘n’’] |
| 2 | -8.0 | [‘‘n’’, ‘‘e’’] |
| 3 | -1.0 | [‘‘new’’] \(\leftarrow\) better than [‘‘n’’, ‘‘e’’, ‘‘w’’] = -12.0 |
| 4 | -3.0 | [‘‘new’’, ‘‘er’’] |
Key computation at \(i=3\):
Option 1: \(\text{best}[2] + \log P(\texttt{``w''}) = -8.0 + (-4.0) = -12.0\)
Option 2: \(\text{best}[0] + \log P(\texttt{``new''}) = 0.0 + (-1.0) = -1.0\) \(\leftarrow\) winner!
Key computation at \(i=4\):
Option 1: \(\text{best}[3] + \log P(\texttt{``er''}) = -1.0 + (-2.0) = -3.0\) \(\leftarrow\) winner!
Option 2: \(\text{best}[0] + \log P(\texttt{``newer''}) = 0.0 + (-3.5) = -3.5\) (worse)
Final segmentation: [‘‘new’’, ‘‘er’’] with total log-prob = \(-3.0\)
Why this is better than BPE: BPE would greedily merge based on frequency counts without considering the entire word’s likelihood. If ‘‘ne’’ and ‘‘ew’’ happened to be frequent pairs in training, BPE might produce [‘‘ne’’, ‘‘wer’’] or [‘‘new’’, ‘‘er’’] depending on merge order. Unigram LM evaluates all possibilities and picks the globally optimal segmentation under the probabilistic model.
13 High Vocabulary vs. Low Vocabulary
13.1 Tradeoff Table
| Dimension | High Vocab (50k–300k) | Low Vocab (200–5k) |
|---|---|---|
| Sequence length | Short (fewer tokens) | Long (4–5× more tokens) |
| Embedding size | Large (30–40% of params) | Very small |
| Training compute | Less per sequence (shorter attention) | More per sequence (longer attention) |
| Generalization | Struggles with typos, rare words | Extremely robust |
| Multilingual | Complex rules, uneven coverage | Uniform across languages |
| Code modeling | Poor without special tokens | Very robust to identifiers |
| Compression | Good | Poor |
| Hardware deployment | Heavy weights (GB scale) | Lightweight (MB scale) |
13.2 Mathematical Perspective
Total compute cost: \[\text{Cost} = \underbrace{n^2 d}_{\text{attention}} + \underbrace{n d V}_{\text{output projection}} + \underbrace{V d}_{\text{embedding params}}\]
where \(n\) = sequence length, \(d\) = hidden dimension, \(V\) = vocabulary size.
What each term represents:
\(n^2 d\): Attention mechanism – computing \(QK^T\) (shape \(n \times n\)) and applying to \(V\), plus Q/K/V projections
\(n d V\): Output projection – mapping final hidden states (\(n \times d\)) to vocabulary logits (\(n \times V\)) via weight matrix (\(d \times V\))
\(V d\): Embedding matrix storage – parameter count for input/output embeddings (often tied)
Concrete example: For GPT-3 scale (50k vocab, 12k context, 12k hidden dim):
Attention: \(n^2 d = (12k)^2 \times 12k = 1.7 \times 10^{12}\) FLOPs per layer
Output: \(n d V = 12k \times 12k \times 50k = 7.2 \times 10^{12}\) FLOPs
Embeddings: \(V d = 50k \times 12k = 600\)M parameters
Output projection dominates compute at this scale!
High vocab: Small \(n\), large \(V\) → output projection/embeddings dominate
Low vocab: Large \(n\), small \(V\) → attention dominates
Sweet spot depends on hardware (memory vs compute bound)
Which is better?
Compute-bound (data center GPUs): High vocab → shorter sequences
Memory-bound (edge devices, mobile): Low vocab → smaller models
Multilingual/code-heavy: Low vocab → better generalization
14 Character and Byte-Level Models
14.1 Why Character/Byte-Level?
Modern LLMs (DeepSeek, Charformer, RWKV, ByT5) are moving toward character or byte-level tokenization.
Advantages:
Eliminates OOV issues: Can represent any text by construction
Smaller embedding matrix: Vocab = 256 bytes vs. 50k subwords
Robust to noise: Typos, OCR errors, social media text
Consistent multilingual behavior: Uniform granularity across languages
Enables ultra-long contexts: Combines with efficient attention (RWKV, Hyena)
Simpler deployment: No external tokenizer runtime
14.2 The Sequence Length Problem
Challenge: Byte-level encoding increases sequence length by 4-5x.
For text: ‘‘The quick brown fox’’
Subword (BPE): 5 tokens
Byte-level: 19 bytes
Attention complexity: \(O(n^2 d)\) → 16× more compute!
14.3 Solutions: Efficient Attention
Local convolutions (Charformer): Downsample bytes → subword-like chunks internally
Linear attention (RWKV): Replace softmax attention with recurrent formulation
Hyena/Monarch: State-space models with sub-quadratic complexity
Delta attention (Kimi): Gated recurrent updates for long-context
Charformer Architecture:
\[\text{Bytes} \xrightarrow{\text{Conv layers}} \text{Subword-like features} \xrightarrow{\text{Transformer}} \text{Output}\]
Input: 256-dim byte embeddings
Local CNN: Group adjacent bytes (e.g., 4-byte windows)
Output: Compressed sequence (\(\sim\)subword length)
Then apply standard Transformer
Result: Byte-level robustness with subword-level efficiency!
Modern Trend: Bytes + efficient attention → best of both worlds.
“Bytes give universality and robustness; subwords give compression. Modern architectures try to compensate for byte-level sequence length with efficient attention.”
15 Perplexity and Tokenization
15.1 The Perplexity Trap
\[\text{PPL} = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log P(t_i)\right)\] where \(N\) is the number of tokens, \(t_i\) are predicted tokens.
Critical issue: Perplexity is not comparable across different tokenizers.
Example: Comparing Two Models
Text: ‘‘The cat sat on the mat’’
Model A (BPE, 50k vocab):
Tokens: [
‘‘The’’,‘‘_cat’’,‘‘_sat’’,‘‘_on’’,‘‘_the’’,‘‘_mat’’]\(N = 6\) tokens
Perplexity = 12.3
Model B (byte-level, 256 vocab):
Tokens: [
T,h,e,_,c,a,t, …]\(N = 22\) bytes
Perplexity = 45.7
Model B appears worse, but it’s predicting at a finer granularity!
15.2 Fair Comparison Metrics
Tokenization-invariant metrics:
Bits per character (BPC): \[\text{BPC} = -\frac{1}{C} \sum_{i=1}^N \log_2 P(t_i)\] where \(C\) = number of characters, \(N\) = number of tokens, and we sum over tokens \(t_i\).
Bits per byte (BPB): Same as BPC but using bytes.
Token-normalized cross-entropy: Use same tokenizer for both models.
Direct task evaluation: MMLU, HELM, HumanEval, RAG benchmarks.
Interview Soundbite:
“Perplexity is only meaningful when tokenization is identical. Tokenizer entropy dominates PPL. When tokenization changes, perplexity no longer measures the same distribution.”
16 Tokenization and KV Cache
16.1 KV Cache Basics
During autoregressive decoding, Transformers cache key-value pairs for all previous tokens:
\[\text{Memory}_{\text{KV}} = 2 \cdot L \cdot d \cdot n \cdot b\]
where:
\(L\) = number of layers
\(d\) = hidden dimension
\(n\) = sequence length (number of tokens)
\(b\) = bytes per parameter (2 for FP16, 4 for FP32)
Factor of 2: keys + values
16.2 Tokenization Impact
Key insight: Tokenization determines \(n\) (sequence length), which linearly affects KV cache size.
KV Cache Calculation:
Model: 40 layers, 8192 hidden dim, FP16 (2 bytes)
Text: 10,000 characters
BPE tokenizer (50k vocab):
Sequence length: 2,500 tokens
KV memory: \(2 \cdot 40 \cdot 8192 \cdot 2500 \cdot 2 = 3.2\) GB
Byte-level tokenizer (256 vocab):
Sequence length: 10,000 bytes (1 byte per character)
KV memory: \(2 \cdot 40 \cdot 8192 \cdot 10000 \cdot 2 = 12.8\) GB
Byte-level uses 4× more memory for KV cache due to longer sequences!
16.3 Consequences for Inference
Latency: More tokens → more decoding steps → higher latency
Memory pressure: Hits GPU memory limits faster
Batch size: Reduces max batch size for batched inference
Sliding window: Hits context window limit sooner
Speculative decoding: Affects branching factor and draft model granularity
Interview Soundbite:
“Tokenization determines effective sequence length, and KV cache scales linearly with token count. Low-vocab tokenizers create longer sequences, increasing cache pressure, latency, GPU memory use, and impacting long-context attention strategies.”
17 Domain-Specific Tokenization
17.1 Multilingual Data
Challenges:
Multiple scripts (Latin, CJK, Arabic, Indic, Cyrillic)
Whitespace-agnostic languages (Thai, Lao, Khmer)
Rare Unicode characters, emojis
Best practices:
SentencePiece Unigram LM: Language-agnostic, no whitespace assumptions
Byte-level BPE: Universal coverage
Shared vocabulary: Train on multilingual corpus (mBERT, XLM-R, mT5)
Explicit space marker: SentencePiece commonly uses U+2581 (LOWER ONE EIGHTH BLOCK) as a leading whitespace marker
Larger vocabulary: 100k–250k to balance compression across languages
Multilingual Segmentation:
English: ‘‘Hello world’’ \(\rightarrow\) [‘‘Hello’’, ‘‘_world’’]
Chinese: ‘‘nihao shijie’’ (ni-hao shi-jie) \(\rightarrow\) [‘‘ni’’, ‘‘hao’’, ‘‘shi’’, ‘‘jie’’]
Thai (no spaces): ‘‘sawasdee lok’’ \(\rightarrow\) [‘‘sawasdee’’, ‘‘lok’’]
SentencePiece handles all three uniformly without language-specific rules.
17.2 Code
Challenges:
Long-tail identifiers (
DataLoaderConfigOverride,var_x_p10)Operators (
==,->,::,...)Whitespace and indentation matter (Python)
Mixed natural language (comments) and code syntax
Code-specific tokenizers:
StarCoder, CodeGen, CodeLlama: Train on large code corpora
Preserve whitespace: Use explicit tokens for tabs, newlines
No cross-boundary merges: Don’t merge across
/,.,->Special tokens:
<|endoftext|>,<|file|>,<|repo|>Language markers:
<|python|>,<|java|>, etc.
Why not just BPE?
BPE merges frequent character pairs, but code frequencies are:
Extremely skewed (long tail of rare identifiers)
Not linguistically meaningful (
‘‘->’’should stay atomic)Require syntax sensitivity (don’t merge inside string literals)
17.3 Speech Transcripts
Characteristics:
No punctuation
Filled pauses (
‘‘uh’’,‘‘erm’’,‘‘like’’)Non-standard words, disfluencies, false starts
Proper nouns, rare entities
Design choices:
Byte or character-level: Handles noisy ASR outputs
Add special tokens:
<|startoftranscript|>,<|speaker|>, timestampsWhisper-style: Task tokens (
<|transcribe|>,<|translate|>)ASR normalization: Lowercase, remove disfluencies
17.4 Compression-Heavy Tasks: Summarization, RAG
For summarization:
Summaries compress information → compression interacts with token boundaries
Prefer smaller average token length → better semantic granularity
Stable numeric representation (dates, percentages)
For RAG (Retrieval-Augmented Generation):
Chunking must be tokenizer-aware
Embeddings depend on tokenization quality
Long tokens → uneven chunk boundaries → retrieval drift
Byte-level or Unigram LM yields better recall for embeddings
RAG Chunking Issue:
Document: ‘‘...cardiovascular disease prevalence increased...’’
Bad tokenization:
Chunk 1 ends: [
‘‘cardio’’]Chunk 2 starts: [
‘‘vascular’’,‘‘disease’’, …]
Result: Query ‘‘cardiovascular disease’’ retrieves Chunk 2 but misses context from Chunk 1.
Solution: Use overlap between chunks or tokenizer-aware splitting.
18 Modern Tokenizer Design Principles
18.1 Key Considerations
Vocabulary size: Balance between sequence length and embedding size
Coverage: Handle rare words, typos, emojis, code, multilingual text
Efficiency: Compression ratio (tokens per byte)
Semantic alignment: Tokens should align with meaningful units (morphemes, words)
Stability: Small text changes shouldn’t drastically change tokenization
Deployment: Model size, inference speed, external dependencies
18.2 Interview Cheat Phrases
Tokenizer \(\leftrightarrow\) Model Behavior:
“Tokenizer granularity determines the sequence length distribution, KV cache consumption, training dynamics, and even evaluation metrics like perplexity.”
Byte vs. Subword:
“Bytes give universality and robustness; subwords give compression. Modern architectures try to compensate for byte-level sequence length with efficient attention.”
Unigram vs. BPE:
“BPE greedily merges; Unigram LM evaluates whole-sequence likelihood using Viterbi, which yields more globally optimal segmentations.”
Perplexity:
“Tokenizer entropy dominates PPL. When tokenization changes, perplexity no longer measures the same distribution.”
KV Cache:
“Tokenization determines effective sequence length, and KV cache scales linearly with token count. This directly impacts inference latency and memory.”
19 Advanced Topics
19.1 Adaptive Tokenization
Recent research explores learned tokenization where the model itself determines segmentation.
Soft merging: Learnable downsampling instead of hard BPE merges
Neural tokenization: Train segmentation jointly with language model
Context-dependent: Different segmentation based on surrounding text
19.2 Tokenization-Free Models
Emerging architectures that bypass explicit tokenization:
Perceiver AR: Processes raw bytes with cross-attention to latent array
MEGABYTE: Hierarchical byte-level model (bytes → patches → global)
Character Transformers: Direct character-level prediction with efficient attention
19.3 SentencePiece Implementation
SentencePiece is a widely used tokenizer framework that trains directly on raw text and can implement either BPE or Unigram LM (used in T5/mT5, XLNet, XLM-R, LLaMA, etc.).
Key features:
Language-agnostic (no whitespace assumptions)
Self-contained (no external dependencies)
Supports both BPE and Unigram LM
Deterministic encoding/decoding
Efficient C++ implementation with Python bindings
Why SentencePiece?
Treats text as raw Unicode stream (uses a dedicated whitespace marker, commonly U+2581)
No language-specific preprocessing (Chinese segmentation, etc.)
Same code path for all languages
Used by Google, Meta, Anthropic, etc.
20 Summary: Decision Framework
20.1 When to Use What
| Use Case | Recommended Tokenizer |
|---|---|
| General-purpose LLM | BPE (GPT), Unigram LM (T5), vocab \(\sim\)32k–100k |
| Multilingual | SentencePiece Unigram LM, vocab \(\sim\)100k–250k |
| Code generation | Code-specific BPE (CodeLlama), preserve operators |
| Ultra-long context | Byte-level + efficient attention (RWKV, Charformer) |
| Edge deployment | Low vocab (\(\sim\)5k), small embeddings |
| High robustness (noisy text) | Byte-level or character-level |
| RAG embeddings | Unigram LM or byte-level (stable chunking) |
| Speech transcripts | Character or byte-level + special tokens |
20.2 Core Tradeoffs
| Factor | High Vocab | Low Vocab |
|---|---|---|
| Sequence length | ↓ Short | ↑ Long |
| Model size | ↑ Large | ↓ Small |
| Attention cost | ↓ Low | ↑ High |
| Generalization | ↓ Brittle | ↑ Robust |
| Compression | ↑ Good | ↓ Poor |