8 Chapter 7: Tokenization for Large Language Models

9 Tokenization Fundamentals

9.1 What is Tokenization?

Tokenization is the process of converting raw text into a sequence of discrete units (tokens) that can be mapped to indices in a fixed vocabulary.

\[\text{Text: ``Hello world''} \xrightarrow{\text{tokenize}} [\text{``Hello''}, \text{``\_world''}] \xrightarrow{\text{encode}} [15496, 995]\]

Note

Whitespace markers: Many modern subword tokenizers encode leading whitespace explicitly, e.g., SentencePiece uses a dedicated U+2581 whitespace marker and GPT-2 style BPE uses a dedicated leading-space marker (often rendered as a visible “space” character), instead of relying on a separate “split on spaces” pre-tokenizer.

Why tokenization matters:

Defines the granularity of language modeling (character, subword, word)
Determines vocabulary size and thus embedding matrix size
Affects sequence length and attention complexity
Impacts model’s ability to handle rare words, typos, multilingual text
Directly influences perplexity, KV cache size, inference latency

9.2 Tokenization Hierarchy

Granularity	Vocab Size	Seq Length	Coverage
Character	Small (100–500)	Very long	Perfect
Subword (BPE/Unigram)	Medium (10k–100k)	Medium	High
Word	Large (100k–1M)	Short	Poor (OOV)
Byte	Minimal (256)	Very long	Perfect

Note

Core Tradeoff: Smaller vocabulary → longer sequences → more attention computation. Larger vocabulary → shorter sequences → bigger embedding matrices.

10 Byte Pair Encoding (BPE)

10.1 Algorithm

BPE is a greedy, bottom-up algorithm that iteratively merges the most frequent adjacent token pairs.

Initialize: Start with character-level vocabulary
Count: Find most frequent adjacent pair \((a, b)\) in corpus
Merge: Replace all occurrences of \((a, b)\) with new token \(ab\)
Repeat: Until vocabulary reaches target size

Example

BPE Merge Example:

Corpus: low low low lower lowest

Iteration 1: Most frequent pair = (l, o) → merge to lo

Result: lo w lo w lo w lo wer lo west

Iteration 2: Most frequent pair = (lo, w) → merge to low

Result: low low low lower lowest

Final vocabulary: {l, o, w, e, r, s, t, lo, low, lower, lowest}

10.2 Properties

Deterministic: Given corpus statistics, merges are fixed
Greedy: Locally optimal (maximizes frequency at each step)
No probabilistic model: Pure frequency-based heuristic
Compression-focused: Minimizes total token count on training corpus

Note

BPE Limitation: Greedy merging can produce suboptimal segmentations for rare words. Example: ‘‘unprecedented’’ might segment as [‘‘un’’, ‘‘pre’’, ‘‘ced’’, ‘‘ented’’] instead of meaningful morphemes.

10.3 Byte-Level BPE (GPT-2 Style)

Byte-level BPE operates on UTF-8 bytes instead of Unicode characters:

Base vocabulary = 256 (all possible bytes)
No out-of-vocabulary (OOV) issues by construction
Handles any Unicode text, emojis, rare scripts
Used in GPT-2 and OpenAI’s tiktoken-style tokenizers (e.g., GPT-3.5/4 families)

Example

Why Bytes?

Character ‘‘é’’ (U+00E9):

UTF-8 encoding: 0xC3 0xA9 (2 bytes)
Byte-level BPE sees: [195, 169]
Can be merged if frequent: [‘‘é’’]

This ensures universal coverage without maintaining explicit character sets.

11 WordPiece (BERT-Style)

11.1 What it is

WordPiece is a subword tokenization scheme popularized by BERT. It learns a fixed vocabulary of subword units and uses a deterministic, greedy segmentation at inference time (often “longest-match-first”).

11.2 How it tokenizes

Words are split into subword pieces; continuation pieces are typically marked with a prefix like ##
If a word cannot be fully decomposed into known pieces, it maps to [UNK] (no universal coverage unless augmented with byte fallback)

Example

WordPiece example (BERT convention):

‘‘unaffable’’ \(\rightarrow\) [‘‘un’’, ‘‘##aff’’, ‘‘##able’’]

11.3 Relation to BPE

Both WordPiece and BPE produce subword vocabularies, but they differ in how merges/pieces are chosen during training (BPE: greedy frequency merges; WordPiece: pieces chosen to improve a likelihood objective) and in conventions like ## continuation markers and [UNK] fallback.

12 Unigram Language Model Tokenization

12.1 Probabilistic Formulation

Unlike BPE’s greedy frequency counting, Unigram LM models tokenization as a probabilistic process.

Given vocabulary \(V\) and text \(X\), find the segmentation \(\mathbf{x} = (x_1, \ldots, x_n)\) that maximizes: \[P(\mathbf{x}) = \prod_{i=1}^n P(x_i)\] where \(P(x_i)\) is the unigram probability of token \(x_i\) in \(V\).

12.2 Training Algorithm

Initialize: Start with large vocabulary (all substrings up to length \(k\))
EM Training:
- E-step: For each word, find the maximum likelihood segmentation using Viterbi
- M-step: Update token probabilities based on segmentation frequencies
Prune: Remove tokens that minimally affect likelihood (typically remove 10–20%)
Repeat: Until vocabulary reaches target size

Note

Key Difference from BPE:

BPE: Greedy local merges (bottom-up)
Unigram LM: Global likelihood optimization (top-down pruning)
Unigram LM evaluates entire segmentations, not just adjacent pairs

12.3 Viterbi Decoding

To find the best segmentation, use dynamic programming:

\[\text{best}[i] = \max_{j < i} \left(\text{best}[j] + \log P(X[j:i])\right)\]

where \(X[j:i]\) is substring from position \(j\) to \(i\), and \(\text{best}[i]\) stores the maximum log-probability for segmenting \(X[0:i]\).

Algorithm:

Initialize: \(\text{best}[0] = 0\) (empty string has probability 1)
For each position \(i = 1\) to \(n\):
- Try all substrings \(X[j:i]\) for \(j < i\) that exist in vocabulary \(V\)
- Compute \(\text{best}[j] + \log P(X[j:i])\) for each valid \(j\)
- Set \(\text{best}[i] = \max_j (\text{best}[j] + \log P(X[j:i]))\)
Backtrack to recover the segmentation

Complexity: \(O(n^2 \cdot |V|)\) per word, where \(n\) is word length.

Example

Viterbi Example: Segmenting “newer”

Vocabulary with log-probabilities (trained from corpus):

\(\log P(\texttt{``new''}) = -1.0\) (common word, high prob)
\(\log P(\texttt{``er''}) = -2.0\) (common suffix)
\(\log P(\texttt{``newer''}) = -3.5\) (less frequent as single token)
\(\log P(\texttt{``n''}) = -4.0\), \(\log P(\texttt{``e''}) = -4.0\), \(\log P(\texttt{``w''}) = -4.0\), \(\log P(\texttt{``r''}) = -4.0\) (single chars)

Dynamic Programming Table:

\(i\)	\(\text{best}[i]\)	Best segmentation to position \(i\)
0	0.0	[]
1	-4.0	[`‘‘n’’`]
2	-8.0	[`‘‘n’’`, `‘‘e’’`]
3	-1.0	[`‘‘new’’`] \(\leftarrow\) better than [`‘‘n’’`, `‘‘e’’`, `‘‘w’’`] = -12.0
4	-3.0	[`‘‘new’’`, `‘‘er’’`]

Key computation at \(i=3\):

Option 1: \(\text{best}[2] + \log P(\texttt{``w''}) = -8.0 + (-4.0) = -12.0\)
Option 2: \(\text{best}[0] + \log P(\texttt{``new''}) = 0.0 + (-1.0) = -1.0\) \(\leftarrow\) winner!

Key computation at \(i=4\):

Option 1: \(\text{best}[3] + \log P(\texttt{``er''}) = -1.0 + (-2.0) = -3.0\) \(\leftarrow\) winner!
Option 2: \(\text{best}[0] + \log P(\texttt{``newer''}) = 0.0 + (-3.5) = -3.5\) (worse)

Final segmentation: [‘‘new’’, ‘‘er’’] with total log-prob = \(-3.0\)

Why this is better than BPE: BPE would greedily merge based on frequency counts without considering the entire word’s likelihood. If ‘‘ne’’ and ‘‘ew’’ happened to be frequent pairs in training, BPE might produce [‘‘ne’’, ‘‘wer’’] or [‘‘new’’, ‘‘er’’] depending on merge order. Unigram LM evaluates all possibilities and picks the globally optimal segmentation under the probabilistic model.

13 High Vocabulary vs. Low Vocabulary

13.1 Tradeoff Table

Dimension	High Vocab (50k–300k)	Low Vocab (200–5k)
Sequence length	Short (fewer tokens)	Long (4–5× more tokens)
Embedding size	Large (30–40% of params)	Very small
Training compute	Less per sequence (shorter attention)	More per sequence (longer attention)
Generalization	Struggles with typos, rare words	Extremely robust
Multilingual	Complex rules, uneven coverage	Uniform across languages
Code modeling	Poor without special tokens	Very robust to identifiers
Compression	Good	Poor
Hardware deployment	Heavy weights (GB scale)	Lightweight (MB scale)

13.2 Mathematical Perspective

Total compute cost: \[\text{Cost} = \underbrace{n^2 d}_{\text{attention}} + \underbrace{n d V}_{\text{output projection}} + \underbrace{V d}_{\text{embedding params}}\]

where \(n\) = sequence length, \(d\) = hidden dimension, \(V\) = vocabulary size.

Note

What each term represents:

\(n^2 d\): Attention mechanism – computing \(QK^T\) (shape \(n \times n\)) and applying to \(V\), plus Q/K/V projections
\(n d V\): Output projection – mapping final hidden states (\(n \times d\)) to vocabulary logits (\(n \times V\)) via weight matrix (\(d \times V\))
\(V d\): Embedding matrix storage – parameter count for input/output embeddings (often tied)

Concrete example: For GPT-3 scale (50k vocab, 12k context, 12k hidden dim):

Attention: \(n^2 d = (12k)^2 \times 12k = 1.7 \times 10^{12}\) FLOPs per layer
Output: \(n d V = 12k \times 12k \times 50k = 7.2 \times 10^{12}\) FLOPs
Embeddings: \(V d = 50k \times 12k = 600\)M parameters

Output projection dominates compute at this scale!

High vocab: Small \(n\), large \(V\) → output projection/embeddings dominate
Low vocab: Large \(n\), small \(V\) → attention dominates
Sweet spot depends on hardware (memory vs compute bound)

Note

Which is better?

Compute-bound (data center GPUs): High vocab → shorter sequences
Memory-bound (edge devices, mobile): Low vocab → smaller models
Multilingual/code-heavy: Low vocab → better generalization

14 Character and Byte-Level Models

14.1 Why Character/Byte-Level?

Modern LLMs (DeepSeek, Charformer, RWKV, ByT5) are moving toward character or byte-level tokenization.

Advantages:

Eliminates OOV issues: Can represent any text by construction
Smaller embedding matrix: Vocab = 256 bytes vs. 50k subwords
Robust to noise: Typos, OCR errors, social media text
Consistent multilingual behavior: Uniform granularity across languages
Enables ultra-long contexts: Combines with efficient attention (RWKV, Hyena)
Simpler deployment: No external tokenizer runtime

14.2 The Sequence Length Problem

Challenge: Byte-level encoding increases sequence length by 4-5x.

For text: ‘‘The quick brown fox’’

Subword (BPE): 5 tokens
Byte-level: 19 bytes

Attention complexity: \(O(n^2 d)\) → 16× more compute!

14.3 Solutions: Efficient Attention

Local convolutions (Charformer): Downsample bytes → subword-like chunks internally
Linear attention (RWKV): Replace softmax attention with recurrent formulation
Hyena/Monarch: State-space models with sub-quadratic complexity
Delta attention (Kimi): Gated recurrent updates for long-context

Example

Charformer Architecture:

\[\text{Bytes} \xrightarrow{\text{Conv layers}} \text{Subword-like features} \xrightarrow{\text{Transformer}} \text{Output}\]

Input: 256-dim byte embeddings
Local CNN: Group adjacent bytes (e.g., 4-byte windows)
Output: Compressed sequence (\(\sim\)subword length)
Then apply standard Transformer

Result: Byte-level robustness with subword-level efficiency!

Note

Modern Trend: Bytes + efficient attention → best of both worlds.

“Bytes give universality and robustness; subwords give compression. Modern architectures try to compensate for byte-level sequence length with efficient attention.”

15 Perplexity and Tokenization

15.1 The Perplexity Trap

\[\text{PPL} = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log P(t_i)\right)\] where \(N\) is the number of tokens, \(t_i\) are predicted tokens.

Critical issue: Perplexity is not comparable across different tokenizers.

Example

Example: Comparing Two Models

Text: ‘‘The cat sat on the mat’’

Model A (BPE, 50k vocab):

Tokens: [‘‘The’’, ‘‘_cat’’, ‘‘_sat’’, ‘‘_on’’, ‘‘_the’’, ‘‘_mat’’]
\(N = 6\) tokens
Perplexity = 12.3

Model B (byte-level, 256 vocab):

Tokens: [T, h, e, _, c, a, t, …]
\(N = 22\) bytes
Perplexity = 45.7

Model B appears worse, but it’s predicting at a finer granularity!

15.2 Fair Comparison Metrics

Tokenization-invariant metrics:

Bits per character (BPC): \[\text{BPC} = -\frac{1}{C} \sum_{i=1}^N \log_2 P(t_i)\] where \(C\) = number of characters, \(N\) = number of tokens, and we sum over tokens \(t_i\).
Bits per byte (BPB): Same as BPC but using bytes.
Token-normalized cross-entropy: Use same tokenizer for both models.
Direct task evaluation: MMLU, HELM, HumanEval, RAG benchmarks.

Note

Interview Soundbite:

“Perplexity is only meaningful when tokenization is identical. Tokenizer entropy dominates PPL. When tokenization changes, perplexity no longer measures the same distribution.”

16 Tokenization and KV Cache

16.1 KV Cache Basics

During autoregressive decoding, Transformers cache key-value pairs for all previous tokens:

\[\text{Memory}_{\text{KV}} = 2 \cdot L \cdot d \cdot n \cdot b\]

where:

\(L\) = number of layers
\(d\) = hidden dimension
\(n\) = sequence length (number of tokens)
\(b\) = bytes per parameter (2 for FP16, 4 for FP32)
Factor of 2: keys + values

16.2 Tokenization Impact

Key insight: Tokenization determines \(n\) (sequence length), which linearly affects KV cache size.

Example

KV Cache Calculation:

Model: 40 layers, 8192 hidden dim, FP16 (2 bytes)

Text: 10,000 characters

BPE tokenizer (50k vocab):

Sequence length: 2,500 tokens
KV memory: \(2 \cdot 40 \cdot 8192 \cdot 2500 \cdot 2 = 3.2\) GB

Byte-level tokenizer (256 vocab):

Sequence length: 10,000 bytes (1 byte per character)
KV memory: \(2 \cdot 40 \cdot 8192 \cdot 10000 \cdot 2 = 12.8\) GB

Byte-level uses 4× more memory for KV cache due to longer sequences!

16.3 Consequences for Inference

Latency: More tokens → more decoding steps → higher latency
Memory pressure: Hits GPU memory limits faster
Batch size: Reduces max batch size for batched inference
Sliding window: Hits context window limit sooner
Speculative decoding: Affects branching factor and draft model granularity

Note

Interview Soundbite:

“Tokenization determines effective sequence length, and KV cache scales linearly with token count. Low-vocab tokenizers create longer sequences, increasing cache pressure, latency, GPU memory use, and impacting long-context attention strategies.”

17 Domain-Specific Tokenization

17.1 Multilingual Data

Challenges:

Multiple scripts (Latin, CJK, Arabic, Indic, Cyrillic)
Whitespace-agnostic languages (Thai, Lao, Khmer)
Rare Unicode characters, emojis

Best practices:

SentencePiece Unigram LM: Language-agnostic, no whitespace assumptions
Byte-level BPE: Universal coverage
Shared vocabulary: Train on multilingual corpus (mBERT, XLM-R, mT5)
Explicit space marker: SentencePiece commonly uses U+2581 (LOWER ONE EIGHTH BLOCK) as a leading whitespace marker
Larger vocabulary: 100k–250k to balance compression across languages

Example

Multilingual Segmentation:

English: ‘‘Hello world’’ \(\rightarrow\) [‘‘Hello’’, ‘‘_world’’]

Chinese: ‘‘nihao shijie’’ (ni-hao shi-jie) \(\rightarrow\) [‘‘ni’’, ‘‘hao’’, ‘‘shi’’, ‘‘jie’’]

Thai (no spaces): ‘‘sawasdee lok’’ \(\rightarrow\) [‘‘sawasdee’’, ‘‘lok’’]

SentencePiece handles all three uniformly without language-specific rules.

17.2 Code

Challenges:

Long-tail identifiers (DataLoaderConfigOverride, var_x_p10)
Operators (==, ->, ::, ...)
Whitespace and indentation matter (Python)
Mixed natural language (comments) and code syntax

Code-specific tokenizers:

StarCoder, CodeGen, CodeLlama: Train on large code corpora
Preserve whitespace: Use explicit tokens for tabs, newlines
No cross-boundary merges: Don’t merge across /, ., ->
Special tokens: <|endoftext|>, <|file|>, <|repo|>
Language markers: <|python|>, <|java|>, etc.

Note

Why not just BPE?

BPE merges frequent character pairs, but code frequencies are:

Extremely skewed (long tail of rare identifiers)
Not linguistically meaningful (‘‘->’’ should stay atomic)
Require syntax sensitivity (don’t merge inside string literals)

17.3 Speech Transcripts

Characteristics:

No punctuation
Filled pauses (‘‘uh’’, ‘‘erm’’, ‘‘like’’)
Non-standard words, disfluencies, false starts
Proper nouns, rare entities

Design choices:

Byte or character-level: Handles noisy ASR outputs
Add special tokens: <|startoftranscript|>, <|speaker|>, timestamps
Whisper-style: Task tokens (<|transcribe|>, <|translate|>)
ASR normalization: Lowercase, remove disfluencies

17.4 Compression-Heavy Tasks: Summarization, RAG

For summarization:

Summaries compress information → compression interacts with token boundaries
Prefer smaller average token length → better semantic granularity
Stable numeric representation (dates, percentages)

For RAG (Retrieval-Augmented Generation):

Chunking must be tokenizer-aware
Embeddings depend on tokenization quality
Long tokens → uneven chunk boundaries → retrieval drift
Byte-level or Unigram LM yields better recall for embeddings

Example

RAG Chunking Issue:

Document: ‘‘...cardiovascular disease prevalence increased...’’

Bad tokenization:

Chunk 1 ends: [‘‘cardio’’]
Chunk 2 starts: [‘‘vascular’’, ‘‘disease’’, …]

Result: Query ‘‘cardiovascular disease’’ retrieves Chunk 2 but misses context from Chunk 1.

Solution: Use overlap between chunks or tokenizer-aware splitting.

18 Modern Tokenizer Design Principles

18.1 Key Considerations

Vocabulary size: Balance between sequence length and embedding size
Coverage: Handle rare words, typos, emojis, code, multilingual text
Efficiency: Compression ratio (tokens per byte)
Semantic alignment: Tokens should align with meaningful units (morphemes, words)
Stability: Small text changes shouldn’t drastically change tokenization
Deployment: Model size, inference speed, external dependencies

18.2 Interview Cheat Phrases

Tokenizer \(\leftrightarrow\) Model Behavior:

“Tokenizer granularity determines the sequence length distribution, KV cache consumption, training dynamics, and even evaluation metrics like perplexity.”
Byte vs. Subword:

“Bytes give universality and robustness; subwords give compression. Modern architectures try to compensate for byte-level sequence length with efficient attention.”
Unigram vs. BPE:

“BPE greedily merges; Unigram LM evaluates whole-sequence likelihood using Viterbi, which yields more globally optimal segmentations.”
Perplexity:

“Tokenizer entropy dominates PPL. When tokenization changes, perplexity no longer measures the same distribution.”
KV Cache:

“Tokenization determines effective sequence length, and KV cache scales linearly with token count. This directly impacts inference latency and memory.”

19 Advanced Topics

19.1 Adaptive Tokenization

Recent research explores learned tokenization where the model itself determines segmentation.

Soft merging: Learnable downsampling instead of hard BPE merges
Neural tokenization: Train segmentation jointly with language model
Context-dependent: Different segmentation based on surrounding text

19.2 Tokenization-Free Models

Emerging architectures that bypass explicit tokenization:

Perceiver AR: Processes raw bytes with cross-attention to latent array
MEGABYTE: Hierarchical byte-level model (bytes → patches → global)
Character Transformers: Direct character-level prediction with efficient attention

19.3 SentencePiece Implementation

SentencePiece is a widely used tokenizer framework that trains directly on raw text and can implement either BPE or Unigram LM (used in T5/mT5, XLNet, XLM-R, LLaMA, etc.).

Key features:

Language-agnostic (no whitespace assumptions)
Self-contained (no external dependencies)
Supports both BPE and Unigram LM
Deterministic encoding/decoding
Efficient C++ implementation with Python bindings

Note

Why SentencePiece?

Treats text as raw Unicode stream (uses a dedicated whitespace marker, commonly U+2581)
No language-specific preprocessing (Chinese segmentation, etc.)
Same code path for all languages
Used by Google, Meta, Anthropic, etc.

20 Summary: Decision Framework

20.1 When to Use What

Use Case	Recommended Tokenizer
General-purpose LLM	BPE (GPT), Unigram LM (T5), vocab \(\sim\)32k–100k
Multilingual	SentencePiece Unigram LM, vocab \(\sim\)100k–250k
Code generation	Code-specific BPE (CodeLlama), preserve operators
Ultra-long context	Byte-level + efficient attention (RWKV, Charformer)
Edge deployment	Low vocab (\(\sim\)5k), small embeddings
High robustness (noisy text)	Byte-level or character-level
RAG embeddings	Unigram LM or byte-level (stable chunking)
Speech transcripts	Character or byte-level + special tokens

20.2 Core Tradeoffs

Factor	High Vocab	Low Vocab
Sequence length	↓ Short	↑ Long
Model size	↑ Large	↓ Small
Attention cost	↓ Low	↑ High
Generalization	↓ Brittle	↑ Robust
Compression	↑ Good	↓ Poor