15 Chapter 14: LLM Fine-Tuning
16 Introduction
Fine-tuning adapts pre-trained LLMs to specific tasks or domains. Key challenges:
Memory: Full fine-tuning requires storing optimizer states (2-4× model size for Adam)
Compute: Backward pass through all layers is expensive
Catastrophic Forgetting: Aggressive updates degrade general knowledge
Multi-Task Serving: Switching between task-specific checkpoints is slow
Solution Hierarchy:
Full Fine-Tuning: Update all parameters (baseline, most expensive)
Parameter-Efficient Fine-Tuning (PEFT): Update small subset or low-rank adapters
LoRA: Low-rank adaptation–inject trainable rank decomposition matrices
QLoRA: LoRA + quantized base model (4-bit) for extreme memory efficiency
This Chapter Covers:
Full fine-tuning foundations and when to use it
LoRA mechanics, rank selection, layer targeting
QLoRA: 4-bit quantization + LoRA for consumer GPUs
Adapter architectures: serial, parallel, fusion strategies
Compute optimizations: merged vs dynamic adapters
Synthetic data generation for fine-tuning
17 Full Fine-Tuning
17.1 When to Use Full Fine-Tuning
Use Cases:
Domain shift: Medical, legal, code (vocabulary/distribution far from pre-training)
Small models: \(<\)3B parameters where memory is manageable
Maximum performance: Task requires full model capacity (e.g., complex reasoning)
Sufficient data: 10K+ high-quality examples (low risk of overfitting)
Avoid When:
Limited compute (1-2 GPUs, \(<\)40GB VRAM each)
Small dataset (\(<\)1K examples)–high overfitting risk
Multi-task serving (can’t afford multiple full checkpoints)
17.2 Training Recipe
Hyperparameters:
Learning rate: \(5 \times 10^{-6}\) to \(5 \times 10^{-5}\) (10-100× lower than pre-training)
Batch size: 8-32 per GPU with gradient accumulation
Epochs: 1-3 (monitor validation perplexity closely)
Optimizer: AdamW with \(\beta_1=0.9\), \(\beta_2=0.95\), weight decay \(0.1\)
Warmup: 3-10% of steps
Scheduler: Cosine decay to \(10\%\) of peak LR
Layer-Specific Learning Rates:
Embeddings: \(1 \times 10^{-6}\) (preserve token representations)
Early layers: \(2 \times 10^{-6}\) (low-level features stable)
Middle layers: \(1 \times 10^{-5}\) (task-specific features)
Final layers: \(5 \times 10^{-5}\) (task head, most adaptation)
17.3 Memory Requirements
For a model with \(P\) parameters in FP16/BF16:
Model weights: \(2P\) bytes
Gradients: \(2P\) bytes
Optimizer states (Adam): \(8P\) bytes (FP32 first/second moments)
Activations: Depends on batch size and sequence length (often \(>10P\) for LLMs)
Total: \(\sim 12P + \text{activations}\)
Example: Llama-2-7B (7B parameters \(\times\) 2 bytes = 14GB) requires: \[14\text{GB (weights)} + 14\text{GB (grads)} + 56\text{GB (Adam)} + 30\text{GB (activations)} \approx 114\text{GB}\]
Requires A100 80GB or multi-GPU with model parallelism.
Gradient Checkpointing: Trades compute for memory by recomputing activations during backward pass instead of storing them. Reduces activation memory by \(\sim\)3-5× but increases training time by \(\sim\)20-30%. Essential for large models on limited VRAM.
18 Low-Rank Adaptation (LoRA)
18.1 Core Idea
LoRA hypothesis: Fine-tuning updates have low intrinsic rank.
Instead of updating weight matrix \(W \in \mathbb{R}^{d \times k}\), inject low-rank decomposition: \[W' = W + \Delta W = W + BA\] where:
\(W\): Frozen pre-trained weights (\(d \times k\))
\(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\): Trainable low-rank matrices
\(r \ll \min(d, k)\): Rank (typically \(r = 4\)-\(64\))
Forward Pass: \[h = W x + \frac{\alpha}{r} BA x\] where \(\alpha\) is a scaling factor (typically \(\alpha = r\) so scaling is \(1\)).
Parameter Reduction: Original has \(dk\) parameters; LoRA adds \(r(d+k)\).
For \(d=k=4096\) and \(r=8\): \[\text{Original: } 4096^2 = 16.8\text{M params} \quad \text{LoRA: } 8 \times 8192 = 65.5\text{K params} \quad (**256× reduction**)\]
18.2 Initialization
Standard LoRA:
\(A \sim \mathcal{N}(0, \sigma^2)\) where \(\sigma = 1/\sqrt{r}\) (Kaiming-like)
\(B = 0\) (ensures \(\Delta W = 0\) at initialization–model starts identical to base)
Scaling \(\Delta W = \frac{\alpha}{r} BA\) with \(\alpha \approx r\) keeps update magnitude stable
Why this works: Setting \(B=0\) preserves the pretrained model at step 0. Gradients first move \(B\) off zero, then \(A\) starts learning. Initializing \(A=0\) would block learning (zero gradients for \(B\)).
Alternative (Warm Start):
Initialize \(BA\) via SVD of small full fine-tuning update: \(\Delta W_{\text{warmup}} \approx U \Sigma V^T\), then \(B = U_{:r} \Sigma_{:r}^{1/2}\), \(A = \Sigma_{:r}^{1/2} V_{:r}^T\)
This SVD-informed initialization starts LoRA near a known good update direction, improving early convergence
18.3 Which Layers to Apply LoRA?
Typical Choices (Transformers):
Query/Value (Q, V) only: Default in original paper, works well for most tasks
All attention (Q, K, V, O): Better for complex tasks, 2× more params
Attention + FFN: Maximum capacity, 3-4× more params than Q/V only
Freeze embeddings: Almost always–embeddings are task-agnostic
Rule of Thumb:
Small dataset (\(<\)1K): Q+V only, low rank (\(r=4\)-\(8\))
Medium dataset (1K-10K): Q+K+V+O, medium rank (\(r=16\)-\(32\))
Large dataset (\(>\)10K): Attention + FFN, high rank (\(r=32\)-\(64\))
18.4 Rank Selection
Trade-offs:
Low rank (\(r=4\)-\(8\)): Minimal memory, fast, risk of underfitting
Medium rank (\(r=16\)-\(32\)): Good balance, most common
High rank (\(r=64\)-\(128\)): Approaches full fine-tuning, higher overfitting risk
Empirical Observations:
Task difficulty matters more than dataset size
Complex reasoning (math, code) benefits from \(r=32\)-\(64\)
Simple classification often saturates at \(r=8\)-\(16\)
Ablation study: Train \(r=8, 16, 32\) and pick best validation performance
Adaptive Rank: Some frameworks (e.g., AdaLoRA) dynamically adjust rank per layer during training by pruning low-importance singular values. Saves memory while maintaining performance.
18.5 Training Recipe
Hyperparameters:
Learning rate: \(1 \times 10^{-4}\) to \(5 \times 10^{-4}\) (10× higher than full fine-tuning)
Batch size: 16-64 (can use larger since memory footprint is small)
Epochs: 3-10 (LoRA trains faster than full fine-tuning)
Optimizer: AdamW (only store states for LoRA params, not base model)
Warmup: 5-10% of steps
Scheduler: Linear or cosine decay
Memory Savings:
Base model \(W\): FP16, frozen (no gradients or optimizer states)
LoRA matrices \(A, B\): FP32 or BF16 with optimizer states
Total memory: \(2P_{\text{base}} + 12P_{\text{LoRA}}\)
For Llama-2-7B with \(r=16\) on Q/V (adds \(\sim\)20M trainable params): \[14\text{GB (base)} + 0.24\text{GB (LoRA states)} + 20\text{GB (activations)} \approx 35\text{GB}\] Fits on single A100 40GB!
19 Quantized LoRA (QLoRA)
19.1 Core Idea
QLoRA = Quantize base model to 4-bit + LoRA adapters in high precision.
Motivation: Base model weights \(W\) frozen, so can aggressively quantize. Adapter matrices \(B, A\) trained in FP16/BF16 for stability.
Key Difference from LoRA:
LoRA: Base model in FP16/BF16 (14GB for 7B model)
QLoRA: Base model in NF4 (Normal Float 4-bit) + double quantization (3.5GB for 7B model)
19.2 NF4 Quantization
Normal Float 4-bit (NF4): Quantization levels chosen to match Gaussian distribution \(\mathcal{N}(0, 1)\).
Why? Pre-trained weights approximately \(\mathcal{N}(0, \sigma^2)\). Standard uniform INT4 wastes bins in tail; NF4 concentrates bins near zero.
Key Property: 16 quantization levels positioned such that each bin has equal probability under \(\mathcal{N}(0,1)\) (information-theoretically optimal for Gaussian data).
Double Quantization: Quantize the scale factors themselves to INT8 (saves \(\sim\)0.5GB for 7B model).
19.3 QLoRA vs LoRA
| Method | Base Model | Memory (7B) | Performance |
|---|---|---|---|
| Full Fine-Tuning | FP16 | 114GB | Baseline |
| LoRA | FP16 | 35GB | \(-0.5\%\) to \(-1\%\) |
| QLoRA | NF4 | 12GB | \(-1\%\) to \(-2\%\) |
When to Use QLoRA:
Consumer GPUs (RTX 4090 24GB, RTX 3090 24GB)
Training 13B-33B models on single GPU
Extreme memory constraint (e.g., fine-tuning on laptop with 16GB VRAM)
Trade-offs:
Pros: 3-4× memory reduction vs LoRA, enables larger models
Cons: Slower training (\(\sim\)30% due to dequantization overhead), slight quality drop (\(\sim\)1-2%)
19.4 Training Recipe
Same as LoRA except:
Slightly higher rank (\(r=32\)-\(64\)) to compensate for base model quantization
May need lower learning rate (\(5 \times 10^{-5}\)) for stability
More epochs (5-15) to converge due to quantization noise
Page Optimizer: QLoRA uses paged optimizers (offload optimizer states to CPU RAM when not needed). Essential for fitting 33B+ models on 24GB GPU.
20 Adapter Architectures
Beyond LoRA’s parallel low-rank injection, several adapter patterns exist.
20.1 Serial Adapters (Houlsby-style)
Architecture: Insert adapter modules after each transformer sub-layer.
\(h_1 = \text{Attention}(x) + x\) \(h_2 = \text{Adapter}(h_1) + h_1\) \(h_3 = \text{FFN}(h_2) + h_2\) \(h_4 = \text{Adapter}(h_3) + h_3\)
Adapter Module: \[\text{Adapter}(h) = W_{\text{up}} \cdot \sigma(W_{\text{down}} \cdot h)\] where \(W_{\text{down}} \in \mathbb{R}^{d \times r}\), \(W_{\text{up}} \in \mathbb{R}^{r \times d}\), \(r \ll d\) (bottleneck).
Adapters vs LoRA: Both are Low-Rank Bottlenecks
Mapping:
| Component | Adapter | LoRA |
|---|---|---|
| Down-projection (\(d \to r\)) | \(W_{\text{down}}\) | \(A\) |
| Up-projection (\(r \to d\)) | \(W_{\text{up}}\) | \(B\) |
| Non-linearity | Yes (\(\sigma\)) | No |
| Placement | Serial (after layer) | Parallel (with layer) |
| Merge at inference? | No (sequential) | Yes (\(W' = W + BA\)) |
Both use \(2 \times d \times r\) trainable parameters per module. LoRA’s lack of non-linearity is offset by ability to merge weights, eliminating inference overhead.
Pros/Cons:
Pros: More expressive (non-linear via \(\sigma\)), good for multi-task learning
Cons: Adds latency (sequential), 2× more adapter modules than LoRA
20.2 Parallel Adapters (LoRA-style)
Architecture: Add adapter output in parallel with original layer.
\(h = W x + \text{Adapter}(x) + x\)
For LoRA: \(\text{Adapter}(x) = \frac{\alpha}{r} BA x\)
Pros/Cons:
Pros: No latency increase (fused computation), simpler training
Cons: Linear only (no non-linearity), slightly less expressive
20.3 Adapter Fusion
Problem: After training task-specific adapters, how to combine them for multi-task inference?
Naive: Switch adapters per task (requires model reload).
Fusion: Learn weighted combination of adapters with small fusion layer: \[h = Wx + \sum_{i=1}^{N} \alpha_i \cdot \text{Adapter}_i(x)\] where \(\alpha_i\) learned via attention over task embeddings.
Use Case: Multi-task serving where model handles multiple domains (e.g., chatbot with medical/legal/general knowledge).
21 Compute Optimizations
21.1 Merged vs Dynamic Adapters
Merged Adapters (Deployment):
After training, compute \(W' = W + BA\) and save single checkpoint
Pros: No inference overhead, same speed as base model
Cons: Cannot switch tasks dynamically, requires separate checkpoint per task
Implementation:
# Merge LoRA into base model
W_merged = W_base + (lora_B @ lora_A) * (alpha / rank)
Dynamic Adapters (Multi-Task Serving):
Keep base model \(W\) frozen, compute \(Wx + BAx\) at runtime
Pros: Switch adapters per request (load \(B, A\) from disk), single base model for all tasks
Cons: \(\sim\)10-20% latency overhead for separate matrix multiplies
21.2 Batched Multi-Adapter Inference
Problem: Batch contains requests for different tasks (different adapters).
Naive: Run each request separately (no batching benefit).
Optimized (S-LoRA): Compute base model forward pass once, then apply task-specific adapters:
\(H_{\text{base}} = W X\) \(H_t = H_{\text{base}} + B_t A_t X_t\)
Memory Management: Keep hot adapters in GPU memory, swap cold adapters to CPU/disk.
Production Example: vLLM + LoRAX supports batched multi-adapter inference with \(<\)5% throughput degradation vs single-task.
21.3 Fused Kernels for LoRA
Standard Approach:
Compute \(y_1 = Wx\) (base GEMM)
Compute \(y_2 = Ax\) (adapter GEMM)
Compute \(y_3 = By_2\) (adapter GEMM)
Add: \(y = y_1 + \frac{\alpha}{r} y_3\)
Total: 3 kernel launches, poor memory locality.
Fused Kernel:
Single kernel computes \(y = Wx + \frac{\alpha}{r} B(Ax)\)
Speedup: 20-40% faster than naive implementation
Available in: PEFT library, vLLM, TensorRT-LLM
21.4 Training Speedups
Gradient Checkpointing: Even with LoRA, activations dominate memory. Checkpointing reduces memory by 3-5×.
Mixed Precision:
Base model: FP16/BF16 (or NF4 for QLoRA)
Adapters: FP32 (higher precision for stability)
Gradients: FP16 (reduce memory)
DeepSpeed ZeRO: Shard optimizer states across GPUs (Stage 2) or all parameters (Stage 3). Enables LoRA training of 70B+ models on 8× A100.
22 Synthetic Data for Fine-Tuning
22.1 When to Use Synthetic Data
Scenarios:
Limited labeled data (\(<\)100 examples)
Domain-specific task with no public dataset (e.g., company-internal QA)
Data augmentation for low-resource tasks
Bootstrapping for instruction tuning
22.2 Generation Methods
22.2.1 Teacher Model Sampling
Same as distillation (nucleus sampling, \(p=0.9\)):
Curate seed prompts (100-1K examples covering task distribution)
Generate completions from larger teacher model
Filter by quality: perplexity \(<\) threshold, length in range, no toxicity
Fine-tune student on synthetic (prompt, completion) pairs
Production Example: Alpaca (52K instruction-following examples generated from GPT-3.5-Turbo + 175 seed prompts).
22.2.2 Self-Instruct
Idea: Use model to generate its own training data iteratively.
Initialize with small seed set (e.g., 50 examples) Sample \(n\) seed examples from current dataset Prompt model to generate new instructions Generate outputs for new instructions Filter for quality and diversity Add to dataset and fine-tune
Challenges:
Quality degrades over iterations (model amplifies its own mistakes)
Requires strong base model (GPT-3.5+ level)
Need diversity filters (avoid redundant examples)
22.2.3 Evol-Instruct
Idea: Iteratively increase complexity of instructions.
Complexity Operations:
Add constraints (e.g., "in 100 words", "without using the letter ‘e’")
Increase reasoning steps (multi-hop questions)
Add domain knowledge requirements
Combine multiple skills (summarize + translate)
Example:
Seed: "Summarize this article."
Evolved: "Summarize this medical research article in layman’s terms, focusing on clinical implications, in under 150 words."
Used in WizardLM, WizardCoder.
22.3 Quality Control
Filters:
Perplexity: Reject examples with \(\text{PPL} > 100\) (likely gibberish)
Length: Filter too short (\(<\)20 tokens) or too long (\(>\)2K tokens)
Diversity: Use embedding clustering, discard near-duplicates
Toxicity: Run Perspective API or toxicity classifier
Task adherence: Prompt-based validation (does output follow instruction?)
Human-in-the-Loop:
Sample 5-10% of synthetic data for manual review
Identify systematic errors (e.g., model always refuses certain instructions)
Refine generation prompts based on failure modes
23 End-to-End: Fine-Tuning Qwen-Coder for Custom Repository
23.1 Use Case & Goals
Scenario: You have a proprietary codebase (e.g., internal Python framework with custom APIs, naming conventions, architectural patterns) and want to adapt Qwen2.5-Coder-7B to generate code following your conventions.
Goals:
Generate code using custom APIs (not seen during pretraining)
Follow internal naming conventions and style guides
Handle repository-specific patterns (e.g., config managers, logging utilities)
Maintain general coding ability without catastrophic forgetting
23.2 Step 1: Tokenization Analysis
Check Vocabulary Coverage:
Qwen-Coder uses a large vocabulary (\(\sim\)152k tokens) trained on code corpora. However, your custom APIs may not be well-represented.
Extract custom identifiers: Collect function names, class names, variables from your codebase:
# identifiers.py from transformers import AutoTokenizer import re tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B") # Extract identifiers from code with open("your_repo/core/api.py") as f: code = f.read() identifiers = re.findall(r'\b[A-Za-z_][A-Za-z0-9_]*\b', code) # Check tokenization for name in set(identifiers): tokens = tokenizer.tokenize(name) if len(tokens) > 1: print(f"{name} -> {tokens} # Fragmented!")Decision: If \(>\)20% of identifiers fragment into 3+ tokens, consider adding custom tokens to vocabulary. Otherwise, proceed without modification (Qwen-Coder handles long identifiers reasonably).
Vocabulary extension (optional):
# Add custom tokens (use sparingly!) custom_tokens = ["CustomAPIClient", "InternalConfig", ...] tokenizer.add_tokens(custom_tokens) model.resize_token_embeddings(len(tokenizer))Warning: New token embeddings are randomly initialized–requires more training data to learn.
23.3 Step 2: Data Preparation
Dataset Construction:
Extract repository snippets:
Parse Python files, extract functions/classes with docstrings
Create pairs: (docstring \(\rightarrow\) implementation)
Filter: remove trivial functions (\(<\)5 lines), keep high-quality comments
Synthetic pair generation: Use GPT-4 or Claude to generate instruction-code pairs:
# Prompt template You are documenting a Python codebase. Given this function: ```python {function_code} `python Generate: 1. A natural language instruction requesting this function 2. A docstring explaining its purpose 3. Example usageFormat for instruction tuning:
# dataset.jsonl (ChatML format for Qwen) { "messages": [ {"role": "system", "content": "You are a code assistant..."}, {"role": "user", "content": "Write a function to load config..."}, {"role": "assistant", "content": "```python\n{code}\n```"} ] }Mix with general code data: Include 20-30% open-source examples (HumanEval, MBPP) to prevent forgetting.
Target Dataset Size:
Minimum: 500 high-quality pairs
Recommended: 2,000-5,000 pairs (mix of real + synthetic)
Maximum: 10,000+ if available (diminishing returns)
23.4 Step 3: LoRA Configuration
Why LoRA for Code Models:
Qwen2.5-Coder-7B has 7B params → full fine-tuning needs 114GB GPU memory
LoRA reduces to 12-24GB (fits on single A10/A100 40GB)
Preserves base model’s general coding ability
Enables multi-repository adapters (train separate LoRAs per codebase)
Recommended Hyperparameters:
| Parameter | Value | Rationale |
|---|---|---|
| Rank \(r\) | 32 | Code generation needs higher capacity than classification |
| Alpha \(\alpha\) | 64 | \(\alpha = 2r\) for stable scaling |
| Target modules | Q+K+V+O+FFN | Code requires reasoning (attention + feedforward) |
| Dropout | 0.05 | Light regularization |
| Learning rate | \(3 \times 10^{-4}\) | Standard for LoRA |
| Batch size | 4-8 | Per-device, use gradient accumulation |
| Gradient accum steps | 4 | Effective batch = 16-32 |
| Epochs | 3-5 | Monitor validation, stop early |
| Max seq length | 2048 | Code context window |
23.5 Step 4: Training with PEFT + DeepSpeed
Setup (HuggingFace PEFT):
# train.py
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
import torch
# Load model in 4-bit for QLoRA (optional)
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-7B",
load_in_4bit=True, # Use QLoRA
torch_dtype=torch.bfloat16,
device_map="auto"
)
model = prepare_model_for_kbit_training(model)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B")
tokenizer.pad_token = tokenizer.eos_token
# LoRA config
lora_config = LoraConfig(
r=32,
lora_alpha=64,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
print(f"Trainable params: {model.print_trainable_parameters()}")
# Training arguments
training_args = TrainingArguments(
output_dir="./qwen-coder-custom-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=3e-4,
fp16=False,
bf16=True,
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
warmup_steps=50,
lr_scheduler_type="cosine",
max_grad_norm=1.0
)
# SFTTrainer for instruction tuning
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
max_seq_length=2048,
packing=False # Keep False for code (maintain context boundaries)
)
trainer.train()
trainer.save_model()
Expected Training Time:
Hardware: Single A100 40GB
Dataset: 5,000 examples, max length 2048
Time: \(\sim\)6-8 hours for 3 epochs
Memory: \(\sim\)24GB (QLoRA) or 40GB (16-bit LoRA)
23.6 Step 5: Evaluation
Metrics:
Pass@k on custom test set:
Create 50-100 held-out instructions from your codebase
Generate \(k=10\) completions per instruction
Execute and verify correctness (unit tests)
Measure: fraction that pass tests
HumanEval retention: Run HumanEval benchmark to ensure no catastrophic forgetting of general coding.
Style adherence: Manual review–does generated code follow your conventions?
Naming (snake_case vs camelCase)
Import patterns (
from mylib import Xvsimport mylib.X)Error handling (custom exceptions)
Inference Example:
# inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-7B",
torch_dtype=torch.bfloat16,
device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./qwen-coder-custom-lora")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B")
# Generate
messages = [
{"role": "system", "content": "You are an expert in our codebase."},
{"role": "user", "content": "Write a function to load config from YAML"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
23.7 Step 6: Deployment Strategies
Option 1: Merge LoRA into base model (single-task):
from peft import PeftModel
model = PeftModel.from_pretrained(base_model, "./qwen-coder-custom-lora")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./qwen-coder-merged")
Pros: Faster inference (no adapter overhead)
Cons: Cannot switch between repositories
Option 2: Multi-adapter serving (multi-task):
Keep base model in memory
Load LoRA adapters dynamically per request (LoRAX, vLLM)
Each repository gets its own adapter
Latency: \(<\)50ms to swap adapters
Option 3: Quantized deployment (edge):
# Quantize merged model to INT8/INT4
from optimum.quanto import quantize, qint4
quantize(merged_model, weights=qint4, activations=None)
merged_model.save_pretrained("./qwen-coder-int4")
23.8 Production Considerations
Data pipeline: Automate scraping new code weekly, retrain LoRA monthly
Version control: Tag adapters with repo commits (
lora-v1.2.3)A/B testing: Serve base model vs LoRA to measure quality improvement
Monitoring: Log generated code → manual review → feedback loop
Compliance: Ensure training data doesn’t leak proprietary secrets (filter credentials, keys)
Multi-repo scaling: Train separate adapters per microservice/team
24 Best Practices & Common Pitfalls
24.1 Best Practices
General:
Start small: Try LoRA with \(r=8\) on Q+V before scaling up
Monitor validation: Stop early if validation loss plateaus or increases
Save checkpoints: Save every epoch–best checkpoint often not the last
Ablate hyperparameters: Test \(r \in \{8, 16, 32\}\) and LR \(\in \{10^{-4}, 3 \times 10^{-4}, 10^{-3}\}\)
Dataset-Specific:
Small data (\(<\)1K): Low rank (\(r=4\)-\(8\)), more regularization (dropout \(0.1\)), more epochs (10-20)
Medium data (1K-10K): Standard recipe (\(r=16\), LR \(3 \times 10^{-4}\), 5-10 epochs)
Large data (\(>\)10K): Higher rank (\(r=32\)-\(64\)), consider full fine-tuning if compute allows
Task-Specific:
Classification/NER: Q+V sufficient, low rank
Generation (summarization, translation): Q+K+V+O, medium rank
Complex reasoning (math, code): Attention + FFN, high rank (\(r=32\)-\(64\))
24.2 Common Pitfalls
Rank too high: Overfits on small data, wasted memory
Learning rate too high: Catastrophic forgetting, loss spikes
Training too long: Overfitting, validation loss increases after epoch 3-5
Ignoring data quality: Synthetic data with errors propagates to model
Not freezing embeddings: Wastes memory, destabilizes rare tokens
Merging adapters prematurely: Test dynamic inference first (may need multi-task)
24.3 Production Recipes
Llama-2-7B Fine-Tuning (LoRA):
Layers: Q+V, \(r=16\), \(\alpha=16\)
LR: \(3 \times 10^{-4}\), warmup 3%, cosine decay
Batch size: 32 (gradient accumulation 4), epochs: 5
Memory: 35GB (A100 40GB)
Llama-2-13B Fine-Tuning (QLoRA):
Base: NF4 + double quantization, Adapters: Q+K+V+O, \(r=32\)
LR: \(1 \times 10^{-4}\), warmup 5%, linear decay
Batch size: 16, epochs: 10
Memory: 18GB (RTX 4090 24GB)
Multi-Task Adapter Serving:
Base model in GPU memory (14GB for 7B)
Hot adapters in GPU (top 10, \(\sim\)100MB each)
Cold adapters on CPU/disk (swap on demand, \(<\)50ms latency)
Use vLLM + LoRAX for batched inference
25 Interview Questions
Q: Why does LoRA work?
A: Fine-tuning updates have low intrinsic rank–most information in top few singular values. Full-rank updates are redundant.
Q: LoRA vs QLoRA?
A: LoRA uses FP16 base (35GB for 7B), QLoRA uses NF4 base (12GB). QLoRA adds \(\sim\)1-2% quality loss but enables larger models on consumer GPUs.
Q: When to use full fine-tuning vs LoRA?
A: Full FT for domain shift + large data + sufficient compute. LoRA for limited compute, small data, multi-task serving.
Q: How to choose rank \(r\)?
A: Start with \(r=16\). Ablate \(\{8, 16, 32\}\) based on validation. Complex tasks (code, math) benefit from \(r=32\)-\(64\). Small data risks overfitting with high \(r\).
Q: Which layers for LoRA?
A: Q+V default, Q+K+V+O for better quality, attention+FFN for max capacity. Always freeze embeddings.
Q: Merged vs dynamic adapters?
A: Merge for single-task deployment (no overhead). Keep dynamic for multi-task serving (10-20% latency cost but flexible).
Q: How to generate synthetic data?
A: Nucleus sampling (\(p=0.9\)) from teacher model. Filter by perplexity, length, toxicity. Human review 5-10%. Examples: Alpaca (52K from GPT-3.5), WizardLM (Evol-Instruct).
26 Summary
Fine-Tuning Hierarchy:
Full Fine-Tuning: Update all params, highest quality, 114GB for 7B model
LoRA: Low-rank adapters, 35GB for 7B, 256× param reduction, \(<\)1% quality loss
QLoRA: 4-bit base + LoRA, 12GB for 7B, 1-2% quality loss, consumer GPU friendly
Key Hyperparameters:
Rank: \(r=16\) default, ablate \(\{8, 16, 32\}\)
Layers: Q+V (default), Q+K+V+O (better), attention+FFN (max)
LR: \(3 \times 10^{-4}\) for LoRA, \(5 \times 10^{-5}\) for full FT
Epochs: 5-10 for LoRA, 1-3 for full FT
Deployment Strategies:
Single-task: Merge adapters into base model (no overhead)
Multi-task: Dynamic adapters with batched inference (S-LoRA, vLLM)
Extreme memory: QLoRA + gradient checkpointing + DeepSpeed ZeRO
Synthetic Data:
Teacher sampling with nucleus (\(p=0.9\)) for diversity
Filter by perplexity, length, toxicity, diversity
Self-Instruct / Evol-Instruct for bootstrapping
Mix 70-90% synthetic with 10-30% real data