8  Chapter 7: Tokenization for Large Language Models

9 Tokenization Fundamentals

9.1 What is Tokenization?

Tokenization is the process of converting raw text into a sequence of discrete units (tokens) that can be mapped to indices in a fixed vocabulary.

\[\text{Text: ``Hello world''} \xrightarrow{\text{tokenize}} [\text{``Hello''}, \text{``\_world''}] \xrightarrow{\text{encode}} [15496, 995]\]

Note

Whitespace markers: Many modern subword tokenizers encode leading whitespace explicitly, e.g., SentencePiece uses a dedicated U+2581 whitespace marker and GPT-2 style BPE uses a dedicated leading-space marker (often rendered as a visible “space” character), instead of relying on a separate “split on spaces” pre-tokenizer.

Why tokenization matters:

  • Defines the granularity of language modeling (character, subword, word)

  • Determines vocabulary size and thus embedding matrix size

  • Affects sequence length and attention complexity

  • Impacts model’s ability to handle rare words, typos, multilingual text

  • Directly influences perplexity, KV cache size, inference latency

9.2 Tokenization Hierarchy

Granularity Vocab Size Seq Length Coverage
Character Small (100–500) Very long Perfect
Subword (BPE/Unigram) Medium (10k–100k) Medium High
Word Large (100k–1M) Short Poor (OOV)
Byte Minimal (256) Very long Perfect
Note

Core Tradeoff: Smaller vocabulary → longer sequences → more attention computation. Larger vocabulary → shorter sequences → bigger embedding matrices.

10 Byte Pair Encoding (BPE)

10.1 Algorithm

BPE is a greedy, bottom-up algorithm that iteratively merges the most frequent adjacent token pairs.

  1. Initialize: Start with character-level vocabulary

  2. Count: Find most frequent adjacent pair \((a, b)\) in corpus

  3. Merge: Replace all occurrences of \((a, b)\) with new token \(ab\)

  4. Repeat: Until vocabulary reaches target size

TipExample

BPE Merge Example:

Corpus: low low low lower lowest

Iteration 1: Most frequent pair = (l, o) → merge to lo

Result: lo w lo w lo w lo wer lo west

Iteration 2: Most frequent pair = (lo, w) → merge to low

Result: low low low lower lowest

Final vocabulary: {l, o, w, e, r, s, t, lo, low, lower, lowest}

10.2 Properties

  • Deterministic: Given corpus statistics, merges are fixed

  • Greedy: Locally optimal (maximizes frequency at each step)

  • No probabilistic model: Pure frequency-based heuristic

  • Compression-focused: Minimizes total token count on training corpus

Note

BPE Limitation: Greedy merging can produce suboptimal segmentations for rare words. Example: ‘‘unprecedented’’ might segment as [‘‘un’’, ‘‘pre’’, ‘‘ced’’, ‘‘ented’’] instead of meaningful morphemes.

10.3 Byte-Level BPE (GPT-2 Style)

Byte-level BPE operates on UTF-8 bytes instead of Unicode characters:

  • Base vocabulary = 256 (all possible bytes)

  • No out-of-vocabulary (OOV) issues by construction

  • Handles any Unicode text, emojis, rare scripts

  • Used in GPT-2 and OpenAI’s tiktoken-style tokenizers (e.g., GPT-3.5/4 families)

TipExample

Why Bytes?

Character ‘‘é’’ (U+00E9):

  • UTF-8 encoding: 0xC3 0xA9 (2 bytes)

  • Byte-level BPE sees: [195, 169]

  • Can be merged if frequent: [‘‘é’’]

This ensures universal coverage without maintaining explicit character sets.

11 WordPiece (BERT-Style)

11.1 What it is

WordPiece is a subword tokenization scheme popularized by BERT. It learns a fixed vocabulary of subword units and uses a deterministic, greedy segmentation at inference time (often “longest-match-first”).

11.2 How it tokenizes

  • Words are split into subword pieces; continuation pieces are typically marked with a prefix like ##

  • If a word cannot be fully decomposed into known pieces, it maps to [UNK] (no universal coverage unless augmented with byte fallback)

TipExample

WordPiece example (BERT convention):

‘‘unaffable’’ \(\rightarrow\) [‘‘un’’, ‘‘##aff’’, ‘‘##able’’]

11.3 Relation to BPE

Both WordPiece and BPE produce subword vocabularies, but they differ in how merges/pieces are chosen during training (BPE: greedy frequency merges; WordPiece: pieces chosen to improve a likelihood objective) and in conventions like ## continuation markers and [UNK] fallback.

12 Unigram Language Model Tokenization

12.1 Probabilistic Formulation

Unlike BPE’s greedy frequency counting, Unigram LM models tokenization as a probabilistic process.

Given vocabulary \(V\) and text \(X\), find the segmentation \(\mathbf{x} = (x_1, \ldots, x_n)\) that maximizes: \[P(\mathbf{x}) = \prod_{i=1}^n P(x_i)\] where \(P(x_i)\) is the unigram probability of token \(x_i\) in \(V\).

12.2 Training Algorithm

  1. Initialize: Start with large vocabulary (all substrings up to length \(k\))

  2. EM Training:

    • E-step: For each word, find the maximum likelihood segmentation using Viterbi

    • M-step: Update token probabilities based on segmentation frequencies

  3. Prune: Remove tokens that minimally affect likelihood (typically remove 10–20%)

  4. Repeat: Until vocabulary reaches target size

Note

Key Difference from BPE:

  • BPE: Greedy local merges (bottom-up)

  • Unigram LM: Global likelihood optimization (top-down pruning)

  • Unigram LM evaluates entire segmentations, not just adjacent pairs

12.3 Viterbi Decoding

To find the best segmentation, use dynamic programming:

\[\text{best}[i] = \max_{j < i} \left(\text{best}[j] + \log P(X[j:i])\right)\]

where \(X[j:i]\) is substring from position \(j\) to \(i\), and \(\text{best}[i]\) stores the maximum log-probability for segmenting \(X[0:i]\).

Algorithm:

  1. Initialize: \(\text{best}[0] = 0\) (empty string has probability 1)

  2. For each position \(i = 1\) to \(n\):

    • Try all substrings \(X[j:i]\) for \(j < i\) that exist in vocabulary \(V\)

    • Compute \(\text{best}[j] + \log P(X[j:i])\) for each valid \(j\)

    • Set \(\text{best}[i] = \max_j (\text{best}[j] + \log P(X[j:i]))\)

  3. Backtrack to recover the segmentation

Complexity: \(O(n^2 \cdot |V|)\) per word, where \(n\) is word length.

TipExample

Viterbi Example: Segmenting “newer”

Vocabulary with log-probabilities (trained from corpus):

  • \(\log P(\texttt{``new''}) = -1.0\) (common word, high prob)

  • \(\log P(\texttt{``er''}) = -2.0\) (common suffix)

  • \(\log P(\texttt{``newer''}) = -3.5\) (less frequent as single token)

  • \(\log P(\texttt{``n''}) = -4.0\), \(\log P(\texttt{``e''}) = -4.0\), \(\log P(\texttt{``w''}) = -4.0\), \(\log P(\texttt{``r''}) = -4.0\) (single chars)

Dynamic Programming Table:

\(i\) \(\text{best}[i]\) Best segmentation to position \(i\)
0 0.0 []
1 -4.0 [‘‘n’’]
2 -8.0 [‘‘n’’, ‘‘e’’]
3 -1.0 [‘‘new’’] \(\leftarrow\) better than [‘‘n’’, ‘‘e’’, ‘‘w’’] = -12.0
4 -3.0 [‘‘new’’, ‘‘er’’]

Key computation at \(i=3\):

  • Option 1: \(\text{best}[2] + \log P(\texttt{``w''}) = -8.0 + (-4.0) = -12.0\)

  • Option 2: \(\text{best}[0] + \log P(\texttt{``new''}) = 0.0 + (-1.0) = -1.0\) \(\leftarrow\) winner!

Key computation at \(i=4\):

  • Option 1: \(\text{best}[3] + \log P(\texttt{``er''}) = -1.0 + (-2.0) = -3.0\) \(\leftarrow\) winner!

  • Option 2: \(\text{best}[0] + \log P(\texttt{``newer''}) = 0.0 + (-3.5) = -3.5\) (worse)

Final segmentation: [‘‘new’’, ‘‘er’’] with total log-prob = \(-3.0\)

Why this is better than BPE: BPE would greedily merge based on frequency counts without considering the entire word’s likelihood. If ‘‘ne’’ and ‘‘ew’’ happened to be frequent pairs in training, BPE might produce [‘‘ne’’, ‘‘wer’’] or [‘‘new’’, ‘‘er’’] depending on merge order. Unigram LM evaluates all possibilities and picks the globally optimal segmentation under the probabilistic model.

13 High Vocabulary vs. Low Vocabulary

13.1 Tradeoff Table

Dimension High Vocab (50k–300k) Low Vocab (200–5k)
Sequence length Short (fewer tokens) Long (4–5× more tokens)
Embedding size Large (30–40% of params) Very small
Training compute Less per sequence (shorter attention) More per sequence (longer attention)
Generalization Struggles with typos, rare words Extremely robust
Multilingual Complex rules, uneven coverage Uniform across languages
Code modeling Poor without special tokens Very robust to identifiers
Compression Good Poor
Hardware deployment Heavy weights (GB scale) Lightweight (MB scale)

13.2 Mathematical Perspective

Total compute cost: \[\text{Cost} = \underbrace{n^2 d}_{\text{attention}} + \underbrace{n d V}_{\text{output projection}} + \underbrace{V d}_{\text{embedding params}}\]

where \(n\) = sequence length, \(d\) = hidden dimension, \(V\) = vocabulary size.

Note

What each term represents:

  • \(n^2 d\): Attention mechanism – computing \(QK^T\) (shape \(n \times n\)) and applying to \(V\), plus Q/K/V projections

  • \(n d V\): Output projection – mapping final hidden states (\(n \times d\)) to vocabulary logits (\(n \times V\)) via weight matrix (\(d \times V\))

  • \(V d\): Embedding matrix storage – parameter count for input/output embeddings (often tied)

Concrete example: For GPT-3 scale (50k vocab, 12k context, 12k hidden dim):

  • Attention: \(n^2 d = (12k)^2 \times 12k = 1.7 \times 10^{12}\) FLOPs per layer

  • Output: \(n d V = 12k \times 12k \times 50k = 7.2 \times 10^{12}\) FLOPs

  • Embeddings: \(V d = 50k \times 12k = 600\)M parameters

Output projection dominates compute at this scale!

  • High vocab: Small \(n\), large \(V\) → output projection/embeddings dominate

  • Low vocab: Large \(n\), small \(V\) → attention dominates

  • Sweet spot depends on hardware (memory vs compute bound)

Note

Which is better?

  • Compute-bound (data center GPUs): High vocab → shorter sequences

  • Memory-bound (edge devices, mobile): Low vocab → smaller models

  • Multilingual/code-heavy: Low vocab → better generalization

14 Character and Byte-Level Models

14.1 Why Character/Byte-Level?

Modern LLMs (DeepSeek, Charformer, RWKV, ByT5) are moving toward character or byte-level tokenization.

Advantages:

  1. Eliminates OOV issues: Can represent any text by construction

  2. Smaller embedding matrix: Vocab = 256 bytes vs. 50k subwords

  3. Robust to noise: Typos, OCR errors, social media text

  4. Consistent multilingual behavior: Uniform granularity across languages

  5. Enables ultra-long contexts: Combines with efficient attention (RWKV, Hyena)

  6. Simpler deployment: No external tokenizer runtime

14.2 The Sequence Length Problem

Challenge: Byte-level encoding increases sequence length by 4-5x.

For text: ‘‘The quick brown fox’’

  • Subword (BPE): 5 tokens

  • Byte-level: 19 bytes

Attention complexity: \(O(n^2 d)\) → 16× more compute!

14.3 Solutions: Efficient Attention

  1. Local convolutions (Charformer): Downsample bytes → subword-like chunks internally

  2. Linear attention (RWKV): Replace softmax attention with recurrent formulation

  3. Hyena/Monarch: State-space models with sub-quadratic complexity

  4. Delta attention (Kimi): Gated recurrent updates for long-context

TipExample

Charformer Architecture:

\[\text{Bytes} \xrightarrow{\text{Conv layers}} \text{Subword-like features} \xrightarrow{\text{Transformer}} \text{Output}\]

  • Input: 256-dim byte embeddings

  • Local CNN: Group adjacent bytes (e.g., 4-byte windows)

  • Output: Compressed sequence (\(\sim\)subword length)

  • Then apply standard Transformer

Result: Byte-level robustness with subword-level efficiency!

Note

Modern Trend: Bytes + efficient attention → best of both worlds.

“Bytes give universality and robustness; subwords give compression. Modern architectures try to compensate for byte-level sequence length with efficient attention.”

15 Perplexity and Tokenization

15.1 The Perplexity Trap

\[\text{PPL} = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log P(t_i)\right)\] where \(N\) is the number of tokens, \(t_i\) are predicted tokens.

Critical issue: Perplexity is not comparable across different tokenizers.

TipExample

Example: Comparing Two Models

Text: ‘‘The cat sat on the mat’’

Model A (BPE, 50k vocab):

  • Tokens: [‘‘The’’, ‘‘_cat’’, ‘‘_sat’’, ‘‘_on’’, ‘‘_the’’, ‘‘_mat’’]

  • \(N = 6\) tokens

  • Perplexity = 12.3

Model B (byte-level, 256 vocab):

  • Tokens: [T, h, e, _, c, a, t, …]

  • \(N = 22\) bytes

  • Perplexity = 45.7

Model B appears worse, but it’s predicting at a finer granularity!

15.2 Fair Comparison Metrics

Tokenization-invariant metrics:

  1. Bits per character (BPC): \[\text{BPC} = -\frac{1}{C} \sum_{i=1}^N \log_2 P(t_i)\] where \(C\) = number of characters, \(N\) = number of tokens, and we sum over tokens \(t_i\).

  2. Bits per byte (BPB): Same as BPC but using bytes.

  3. Token-normalized cross-entropy: Use same tokenizer for both models.

  4. Direct task evaluation: MMLU, HELM, HumanEval, RAG benchmarks.

Note

Interview Soundbite:

“Perplexity is only meaningful when tokenization is identical. Tokenizer entropy dominates PPL. When tokenization changes, perplexity no longer measures the same distribution.”

16 Tokenization and KV Cache

16.1 KV Cache Basics

During autoregressive decoding, Transformers cache key-value pairs for all previous tokens:

\[\text{Memory}_{\text{KV}} = 2 \cdot L \cdot d \cdot n \cdot b\]

where:

  • \(L\) = number of layers

  • \(d\) = hidden dimension

  • \(n\) = sequence length (number of tokens)

  • \(b\) = bytes per parameter (2 for FP16, 4 for FP32)

  • Factor of 2: keys + values

16.2 Tokenization Impact

Key insight: Tokenization determines \(n\) (sequence length), which linearly affects KV cache size.

TipExample

KV Cache Calculation:

Model: 40 layers, 8192 hidden dim, FP16 (2 bytes)

Text: 10,000 characters

BPE tokenizer (50k vocab):

  • Sequence length: 2,500 tokens

  • KV memory: \(2 \cdot 40 \cdot 8192 \cdot 2500 \cdot 2 = 3.2\) GB

Byte-level tokenizer (256 vocab):

  • Sequence length: 10,000 bytes (1 byte per character)

  • KV memory: \(2 \cdot 40 \cdot 8192 \cdot 10000 \cdot 2 = 12.8\) GB

Byte-level uses 4× more memory for KV cache due to longer sequences!

16.3 Consequences for Inference

  1. Latency: More tokens → more decoding steps → higher latency

  2. Memory pressure: Hits GPU memory limits faster

  3. Batch size: Reduces max batch size for batched inference

  4. Sliding window: Hits context window limit sooner

  5. Speculative decoding: Affects branching factor and draft model granularity

Note

Interview Soundbite:

“Tokenization determines effective sequence length, and KV cache scales linearly with token count. Low-vocab tokenizers create longer sequences, increasing cache pressure, latency, GPU memory use, and impacting long-context attention strategies.”

17 Domain-Specific Tokenization

17.1 Multilingual Data

Challenges:

  • Multiple scripts (Latin, CJK, Arabic, Indic, Cyrillic)

  • Whitespace-agnostic languages (Thai, Lao, Khmer)

  • Rare Unicode characters, emojis

Best practices:

  1. SentencePiece Unigram LM: Language-agnostic, no whitespace assumptions

  2. Byte-level BPE: Universal coverage

  3. Shared vocabulary: Train on multilingual corpus (mBERT, XLM-R, mT5)

  4. Explicit space marker: SentencePiece commonly uses U+2581 (LOWER ONE EIGHTH BLOCK) as a leading whitespace marker

  5. Larger vocabulary: 100k–250k to balance compression across languages

TipExample

Multilingual Segmentation:

English: ‘‘Hello world’’ \(\rightarrow\) [‘‘Hello’’, ‘‘_world’’]

Chinese: ‘‘nihao shijie’’ (ni-hao shi-jie) \(\rightarrow\) [‘‘ni’’, ‘‘hao’’, ‘‘shi’’, ‘‘jie’’]

Thai (no spaces): ‘‘sawasdee lok’’ \(\rightarrow\) [‘‘sawasdee’’, ‘‘lok’’]

SentencePiece handles all three uniformly without language-specific rules.

17.2 Code

Challenges:

  • Long-tail identifiers (DataLoaderConfigOverride, var_x_p10)

  • Operators (==, ->, ::, ...)

  • Whitespace and indentation matter (Python)

  • Mixed natural language (comments) and code syntax

Code-specific tokenizers:

  1. StarCoder, CodeGen, CodeLlama: Train on large code corpora

  2. Preserve whitespace: Use explicit tokens for tabs, newlines

  3. No cross-boundary merges: Don’t merge across /, ., ->

  4. Special tokens: <|endoftext|>, <|file|>, <|repo|>

  5. Language markers: <|python|>, <|java|>, etc.

Note

Why not just BPE?

BPE merges frequent character pairs, but code frequencies are:

  • Extremely skewed (long tail of rare identifiers)

  • Not linguistically meaningful (‘‘->’’ should stay atomic)

  • Require syntax sensitivity (don’t merge inside string literals)

17.3 Speech Transcripts

Characteristics:

  • No punctuation

  • Filled pauses (‘‘uh’’, ‘‘erm’’, ‘‘like’’)

  • Non-standard words, disfluencies, false starts

  • Proper nouns, rare entities

Design choices:

  1. Byte or character-level: Handles noisy ASR outputs

  2. Add special tokens: <|startoftranscript|>, <|speaker|>, timestamps

  3. Whisper-style: Task tokens (<|transcribe|>, <|translate|>)

  4. ASR normalization: Lowercase, remove disfluencies

17.4 Compression-Heavy Tasks: Summarization, RAG

For summarization:

  • Summaries compress information → compression interacts with token boundaries

  • Prefer smaller average token length → better semantic granularity

  • Stable numeric representation (dates, percentages)

For RAG (Retrieval-Augmented Generation):

  • Chunking must be tokenizer-aware

  • Embeddings depend on tokenization quality

  • Long tokens → uneven chunk boundaries → retrieval drift

  • Byte-level or Unigram LM yields better recall for embeddings

TipExample

RAG Chunking Issue:

Document: ‘‘...cardiovascular disease prevalence increased...’’

Bad tokenization:

  • Chunk 1 ends: [‘‘cardio’’]

  • Chunk 2 starts: [‘‘vascular’’, ‘‘disease’’, …]

Result: Query ‘‘cardiovascular disease’’ retrieves Chunk 2 but misses context from Chunk 1.

Solution: Use overlap between chunks or tokenizer-aware splitting.

18 Modern Tokenizer Design Principles

18.1 Key Considerations

  1. Vocabulary size: Balance between sequence length and embedding size

  2. Coverage: Handle rare words, typos, emojis, code, multilingual text

  3. Efficiency: Compression ratio (tokens per byte)

  4. Semantic alignment: Tokens should align with meaningful units (morphemes, words)

  5. Stability: Small text changes shouldn’t drastically change tokenization

  6. Deployment: Model size, inference speed, external dependencies

18.2 Interview Cheat Phrases

  • Tokenizer \(\leftrightarrow\) Model Behavior:

    “Tokenizer granularity determines the sequence length distribution, KV cache consumption, training dynamics, and even evaluation metrics like perplexity.”

  • Byte vs. Subword:

    “Bytes give universality and robustness; subwords give compression. Modern architectures try to compensate for byte-level sequence length with efficient attention.”

  • Unigram vs. BPE:

    “BPE greedily merges; Unigram LM evaluates whole-sequence likelihood using Viterbi, which yields more globally optimal segmentations.”

  • Perplexity:

    “Tokenizer entropy dominates PPL. When tokenization changes, perplexity no longer measures the same distribution.”

  • KV Cache:

    “Tokenization determines effective sequence length, and KV cache scales linearly with token count. This directly impacts inference latency and memory.”

19 Advanced Topics

19.1 Adaptive Tokenization

Recent research explores learned tokenization where the model itself determines segmentation.

  • Soft merging: Learnable downsampling instead of hard BPE merges

  • Neural tokenization: Train segmentation jointly with language model

  • Context-dependent: Different segmentation based on surrounding text

19.2 Tokenization-Free Models

Emerging architectures that bypass explicit tokenization:

  • Perceiver AR: Processes raw bytes with cross-attention to latent array

  • MEGABYTE: Hierarchical byte-level model (bytes → patches → global)

  • Character Transformers: Direct character-level prediction with efficient attention

19.3 SentencePiece Implementation

SentencePiece is a widely used tokenizer framework that trains directly on raw text and can implement either BPE or Unigram LM (used in T5/mT5, XLNet, XLM-R, LLaMA, etc.).

Key features:

  • Language-agnostic (no whitespace assumptions)

  • Self-contained (no external dependencies)

  • Supports both BPE and Unigram LM

  • Deterministic encoding/decoding

  • Efficient C++ implementation with Python bindings

Note

Why SentencePiece?

  • Treats text as raw Unicode stream (uses a dedicated whitespace marker, commonly U+2581)

  • No language-specific preprocessing (Chinese segmentation, etc.)

  • Same code path for all languages

  • Used by Google, Meta, Anthropic, etc.

20 Summary: Decision Framework

20.1 When to Use What

Use Case Recommended Tokenizer
General-purpose LLM BPE (GPT), Unigram LM (T5), vocab \(\sim\)32k–100k
Multilingual SentencePiece Unigram LM, vocab \(\sim\)100k–250k
Code generation Code-specific BPE (CodeLlama), preserve operators
Ultra-long context Byte-level + efficient attention (RWKV, Charformer)
Edge deployment Low vocab (\(\sim\)5k), small embeddings
High robustness (noisy text) Byte-level or character-level
RAG embeddings Unigram LM or byte-level (stable chunking)
Speech transcripts Character or byte-level + special tokens

20.2 Core Tradeoffs

Factor High Vocab Low Vocab
Sequence length ↓ Short ↑ Long
Model size ↑ Large ↓ Small
Attention cost ↓ Low ↑ High
Generalization ↓ Brittle ↑ Robust
Compression ↑ Good ↓ Poor