15 Chapter 14: LLM Fine-Tuning

16 Introduction

Fine-tuning adapts pre-trained LLMs to specific tasks or domains. Key challenges:

Memory: Full fine-tuning requires storing optimizer states (2-4× model size for Adam)
Compute: Backward pass through all layers is expensive
Catastrophic Forgetting: Aggressive updates degrade general knowledge
Multi-Task Serving: Switching between task-specific checkpoints is slow

Solution Hierarchy:

Full Fine-Tuning: Update all parameters (baseline, most expensive)
Parameter-Efficient Fine-Tuning (PEFT): Update small subset or low-rank adapters
LoRA: Low-rank adaptation–inject trainable rank decomposition matrices
QLoRA: LoRA + quantized base model (4-bit) for extreme memory efficiency

This Chapter Covers:

Full fine-tuning foundations and when to use it
LoRA mechanics, rank selection, layer targeting
QLoRA: 4-bit quantization + LoRA for consumer GPUs
Adapter architectures: serial, parallel, fusion strategies
Compute optimizations: merged vs dynamic adapters
Synthetic data generation for fine-tuning

17 Full Fine-Tuning

17.1 When to Use Full Fine-Tuning

Use Cases:

Domain shift: Medical, legal, code (vocabulary/distribution far from pre-training)
Small models: \(<\)3B parameters where memory is manageable
Maximum performance: Task requires full model capacity (e.g., complex reasoning)
Sufficient data: 10K+ high-quality examples (low risk of overfitting)

Avoid When:

Limited compute (1-2 GPUs, \(<\)40GB VRAM each)
Small dataset (\(<\)1K examples)–high overfitting risk
Multi-task serving (can’t afford multiple full checkpoints)

17.2 Training Recipe

Hyperparameters:

Learning rate: \(5 \times 10^{-6}\) to \(5 \times 10^{-5}\) (10-100× lower than pre-training)
Batch size: 8-32 per GPU with gradient accumulation
Epochs: 1-3 (monitor validation perplexity closely)
Optimizer: AdamW with \(\beta_1=0.9\), \(\beta_2=0.95\), weight decay \(0.1\)
Warmup: 3-10% of steps
Scheduler: Cosine decay to \(10\%\) of peak LR

Layer-Specific Learning Rates:

Embeddings: \(1 \times 10^{-6}\) (preserve token representations)
Early layers: \(2 \times 10^{-6}\) (low-level features stable)
Middle layers: \(1 \times 10^{-5}\) (task-specific features)
Final layers: \(5 \times 10^{-5}\) (task head, most adaptation)

17.3 Memory Requirements

For a model with \(P\) parameters in FP16/BF16:

Model weights: \(2P\) bytes
Gradients: \(2P\) bytes
Optimizer states (Adam): \(8P\) bytes (FP32 first/second moments)
Activations: Depends on batch size and sequence length (often \(>10P\) for LLMs)
Total: \(\sim 12P + \text{activations}\)

Example: Llama-2-7B (7B parameters \(\times\) 2 bytes = 14GB) requires: \[14\text{GB (weights)} + 14\text{GB (grads)} + 56\text{GB (Adam)} + 30\text{GB (activations)} \approx 114\text{GB}\]

Requires A100 80GB or multi-GPU with model parallelism.

Note

Gradient Checkpointing: Trades compute for memory by recomputing activations during backward pass instead of storing them. Reduces activation memory by \(\sim\)3-5× but increases training time by \(\sim\)20-30%. Essential for large models on limited VRAM.

18 Low-Rank Adaptation (LoRA)

18.1 Core Idea

LoRA hypothesis: Fine-tuning updates have low intrinsic rank.

Instead of updating weight matrix \(W \in \mathbb{R}^{d \times k}\), inject low-rank decomposition: \[W' = W + \Delta W = W + BA\] where:

\(W\): Frozen pre-trained weights (\(d \times k\))
\(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\): Trainable low-rank matrices
\(r \ll \min(d, k)\): Rank (typically \(r = 4\)-\(64\))

Forward Pass: \[h = W x + \frac{\alpha}{r} BA x\] where \(\alpha\) is a scaling factor (typically \(\alpha = r\) so scaling is \(1\)).

Parameter Reduction: Original has \(dk\) parameters; LoRA adds \(r(d+k)\).

For \(d=k=4096\) and \(r=8\): \[\text{Original: } 4096^2 = 16.8\text{M params} \quad \text{LoRA: } 8 \times 8192 = 65.5\text{K params} \quad (**256× reduction**)\]

18.2 Initialization

Standard LoRA:

\(A \sim \mathcal{N}(0, \sigma^2)\) where \(\sigma = 1/\sqrt{r}\) (Kaiming-like)
\(B = 0\) (ensures \(\Delta W = 0\) at initialization–model starts identical to base)
Scaling \(\Delta W = \frac{\alpha}{r} BA\) with \(\alpha \approx r\) keeps update magnitude stable

Why this works: Setting \(B=0\) preserves the pretrained model at step 0. Gradients first move \(B\) off zero, then \(A\) starts learning. Initializing \(A=0\) would block learning (zero gradients for \(B\)).

Alternative (Warm Start):

Initialize \(BA\) via SVD of small full fine-tuning update: \(\Delta W_{\text{warmup}} \approx U \Sigma V^T\), then \(B = U_{:r} \Sigma_{:r}^{1/2}\), \(A = \Sigma_{:r}^{1/2} V_{:r}^T\)
This SVD-informed initialization starts LoRA near a known good update direction, improving early convergence

18.3 Which Layers to Apply LoRA?

Typical Choices (Transformers):

Query/Value (Q, V) only: Default in original paper, works well for most tasks
All attention (Q, K, V, O): Better for complex tasks, 2× more params
Attention + FFN: Maximum capacity, 3-4× more params than Q/V only
Freeze embeddings: Almost always–embeddings are task-agnostic

Rule of Thumb:

Small dataset (\(<\)1K): Q+V only, low rank (\(r=4\)-\(8\))
Medium dataset (1K-10K): Q+K+V+O, medium rank (\(r=16\)-\(32\))
Large dataset (\(>\)10K): Attention + FFN, high rank (\(r=32\)-\(64\))

18.4 Rank Selection

Trade-offs:

Low rank (\(r=4\)-\(8\)): Minimal memory, fast, risk of underfitting
Medium rank (\(r=16\)-\(32\)): Good balance, most common
High rank (\(r=64\)-\(128\)): Approaches full fine-tuning, higher overfitting risk

Empirical Observations:

Task difficulty matters more than dataset size
Complex reasoning (math, code) benefits from \(r=32\)-\(64\)
Simple classification often saturates at \(r=8\)-\(16\)
Ablation study: Train \(r=8, 16, 32\) and pick best validation performance

Note

Adaptive Rank: Some frameworks (e.g., AdaLoRA) dynamically adjust rank per layer during training by pruning low-importance singular values. Saves memory while maintaining performance.

18.5 Training Recipe

Hyperparameters:

Learning rate: \(1 \times 10^{-4}\) to \(5 \times 10^{-4}\) (10× higher than full fine-tuning)
Batch size: 16-64 (can use larger since memory footprint is small)
Epochs: 3-10 (LoRA trains faster than full fine-tuning)
Optimizer: AdamW (only store states for LoRA params, not base model)
Warmup: 5-10% of steps
Scheduler: Linear or cosine decay

Memory Savings:

Base model \(W\): FP16, frozen (no gradients or optimizer states)
LoRA matrices \(A, B\): FP32 or BF16 with optimizer states
Total memory: \(2P_{\text{base}} + 12P_{\text{LoRA}}\)

For Llama-2-7B with \(r=16\) on Q/V (adds \(\sim\)20M trainable params): \[14\text{GB (base)} + 0.24\text{GB (LoRA states)} + 20\text{GB (activations)} \approx 35\text{GB}\] Fits on single A100 40GB!

19 Quantized LoRA (QLoRA)

19.1 Core Idea

QLoRA = Quantize base model to 4-bit + LoRA adapters in high precision.

Motivation: Base model weights \(W\) frozen, so can aggressively quantize. Adapter matrices \(B, A\) trained in FP16/BF16 for stability.

Key Difference from LoRA:

LoRA: Base model in FP16/BF16 (14GB for 7B model)
QLoRA: Base model in NF4 (Normal Float 4-bit) + double quantization (3.5GB for 7B model)

19.2 NF4 Quantization

Normal Float 4-bit (NF4): Quantization levels chosen to match Gaussian distribution \(\mathcal{N}(0, 1)\).

Why? Pre-trained weights approximately \(\mathcal{N}(0, \sigma^2)\). Standard uniform INT4 wastes bins in tail; NF4 concentrates bins near zero.

Key Property: 16 quantization levels positioned such that each bin has equal probability under \(\mathcal{N}(0,1)\) (information-theoretically optimal for Gaussian data).

Double Quantization: Quantize the scale factors themselves to INT8 (saves \(\sim\)0.5GB for 7B model).

19.3 QLoRA vs LoRA

Method	Base Model	Memory (7B)	Performance
Full Fine-Tuning	FP16	114GB	Baseline
LoRA	FP16	35GB	\(-0.5\%\) to \(-1\%\)
QLoRA	NF4	12GB	\(-1\%\) to \(-2\%\)

When to Use QLoRA:

Consumer GPUs (RTX 4090 24GB, RTX 3090 24GB)
Training 13B-33B models on single GPU
Extreme memory constraint (e.g., fine-tuning on laptop with 16GB VRAM)

Trade-offs:

Pros: 3-4× memory reduction vs LoRA, enables larger models
Cons: Slower training (\(\sim\)30% due to dequantization overhead), slight quality drop (\(\sim\)1-2%)

19.4 Training Recipe

Same as LoRA except:

Slightly higher rank (\(r=32\)-\(64\)) to compensate for base model quantization
May need lower learning rate (\(5 \times 10^{-5}\)) for stability
More epochs (5-15) to converge due to quantization noise

Note

Page Optimizer: QLoRA uses paged optimizers (offload optimizer states to CPU RAM when not needed). Essential for fitting 33B+ models on 24GB GPU.

20 Adapter Architectures

Beyond LoRA’s parallel low-rank injection, several adapter patterns exist.

20.1 Serial Adapters (Houlsby-style)

Architecture: Insert adapter modules after each transformer sub-layer.

\(h_1 = \text{Attention}(x) + x\) \(h_2 = \text{Adapter}(h_1) + h_1\) \(h_3 = \text{FFN}(h_2) + h_2\) \(h_4 = \text{Adapter}(h_3) + h_3\)

Adapter Module: \[\text{Adapter}(h) = W_{\text{up}} \cdot \sigma(W_{\text{down}} \cdot h)\] where \(W_{\text{down}} \in \mathbb{R}^{d \times r}\), \(W_{\text{up}} \in \mathbb{R}^{r \times d}\), \(r \ll d\) (bottleneck).

Note

Adapters vs LoRA: Both are Low-Rank Bottlenecks

Mapping:

Component	Adapter	LoRA
Down-projection (\(d \to r\))	\(W_{\text{down}}\)	\(A\)
Up-projection (\(r \to d\))	\(W_{\text{up}}\)	\(B\)
Non-linearity	Yes (\(\sigma\))	No
Placement	Serial (after layer)	Parallel (with layer)
Merge at inference?	No (sequential)	Yes (\(W' = W + BA\))

Both use \(2 \times d \times r\) trainable parameters per module. LoRA’s lack of non-linearity is offset by ability to merge weights, eliminating inference overhead.

Pros/Cons:

Pros: More expressive (non-linear via \(\sigma\)), good for multi-task learning
Cons: Adds latency (sequential), 2× more adapter modules than LoRA

20.2 Parallel Adapters (LoRA-style)

Architecture: Add adapter output in parallel with original layer.

\(h = W x + \text{Adapter}(x) + x\)

For LoRA: \(\text{Adapter}(x) = \frac{\alpha}{r} BA x\)

Pros/Cons:

Pros: No latency increase (fused computation), simpler training
Cons: Linear only (no non-linearity), slightly less expressive

20.3 Adapter Fusion

Problem: After training task-specific adapters, how to combine them for multi-task inference?

Naive: Switch adapters per task (requires model reload).

Fusion: Learn weighted combination of adapters with small fusion layer: \[h = Wx + \sum_{i=1}^{N} \alpha_i \cdot \text{Adapter}_i(x)\] where \(\alpha_i\) learned via attention over task embeddings.

Use Case: Multi-task serving where model handles multiple domains (e.g., chatbot with medical/legal/general knowledge).

21 Compute Optimizations

21.1 Merged vs Dynamic Adapters

Merged Adapters (Deployment):

After training, compute \(W' = W + BA\) and save single checkpoint
Pros: No inference overhead, same speed as base model
Cons: Cannot switch tasks dynamically, requires separate checkpoint per task

Implementation:

# Merge LoRA into base model
W_merged = W_base + (lora_B @ lora_A) * (alpha / rank)

Dynamic Adapters (Multi-Task Serving):

Keep base model \(W\) frozen, compute \(Wx + BAx\) at runtime
Pros: Switch adapters per request (load \(B, A\) from disk), single base model for all tasks
Cons: \(\sim\)10-20% latency overhead for separate matrix multiplies

21.2 Batched Multi-Adapter Inference

Problem: Batch contains requests for different tasks (different adapters).

Naive: Run each request separately (no batching benefit).

Optimized (S-LoRA): Compute base model forward pass once, then apply task-specific adapters:

\(H_{\text{base}} = W X\) \(H_t = H_{\text{base}} + B_t A_t X_t\)

Memory Management: Keep hot adapters in GPU memory, swap cold adapters to CPU/disk.

Production Example: vLLM + LoRAX supports batched multi-adapter inference with \(<\)5% throughput degradation vs single-task.

21.3 Fused Kernels for LoRA

Standard Approach:

Compute \(y_1 = Wx\) (base GEMM)
Compute \(y_2 = Ax\) (adapter GEMM)
Compute \(y_3 = By_2\) (adapter GEMM)
Add: \(y = y_1 + \frac{\alpha}{r} y_3\)

Total: 3 kernel launches, poor memory locality.

Fused Kernel:

Single kernel computes \(y = Wx + \frac{\alpha}{r} B(Ax)\)
Speedup: 20-40% faster than naive implementation
Available in: PEFT library, vLLM, TensorRT-LLM

21.4 Training Speedups

Gradient Checkpointing: Even with LoRA, activations dominate memory. Checkpointing reduces memory by 3-5×.

Mixed Precision:

Base model: FP16/BF16 (or NF4 for QLoRA)
Adapters: FP32 (higher precision for stability)
Gradients: FP16 (reduce memory)

DeepSpeed ZeRO: Shard optimizer states across GPUs (Stage 2) or all parameters (Stage 3). Enables LoRA training of 70B+ models on 8× A100.

22 Synthetic Data for Fine-Tuning

22.1 When to Use Synthetic Data

Scenarios:

Limited labeled data (\(<\)100 examples)
Domain-specific task with no public dataset (e.g., company-internal QA)
Data augmentation for low-resource tasks
Bootstrapping for instruction tuning

22.2 Generation Methods

22.2.1 Teacher Model Sampling

Same as distillation (nucleus sampling, \(p=0.9\)):

Curate seed prompts (100-1K examples covering task distribution)
Generate completions from larger teacher model
Filter by quality: perplexity \(<\) threshold, length in range, no toxicity
Fine-tune student on synthetic (prompt, completion) pairs

Production Example: Alpaca (52K instruction-following examples generated from GPT-3.5-Turbo + 175 seed prompts).

22.2.2 Self-Instruct

Idea: Use model to generate its own training data iteratively.

Initialize with small seed set (e.g., 50 examples) Sample \(n\) seed examples from current dataset Prompt model to generate new instructions Generate outputs for new instructions Filter for quality and diversity Add to dataset and fine-tune

Challenges:

Quality degrades over iterations (model amplifies its own mistakes)
Requires strong base model (GPT-3.5+ level)
Need diversity filters (avoid redundant examples)

22.2.3 Evol-Instruct

Idea: Iteratively increase complexity of instructions.

Complexity Operations:

Add constraints (e.g., "in 100 words", "without using the letter ‘e’")
Increase reasoning steps (multi-hop questions)
Add domain knowledge requirements
Combine multiple skills (summarize + translate)

Example:

Seed: "Summarize this article."
Evolved: "Summarize this medical research article in layman’s terms, focusing on clinical implications, in under 150 words."

Used in WizardLM, WizardCoder.

22.3 Quality Control

Filters:

Perplexity: Reject examples with \(\text{PPL} > 100\) (likely gibberish)
Length: Filter too short (\(<\)20 tokens) or too long (\(>\)2K tokens)
Diversity: Use embedding clustering, discard near-duplicates
Toxicity: Run Perspective API or toxicity classifier
Task adherence: Prompt-based validation (does output follow instruction?)

Human-in-the-Loop:

Sample 5-10% of synthetic data for manual review
Identify systematic errors (e.g., model always refuses certain instructions)
Refine generation prompts based on failure modes

23 End-to-End: Fine-Tuning Qwen-Coder for Custom Repository

23.1 Use Case & Goals

Scenario: You have a proprietary codebase (e.g., internal Python framework with custom APIs, naming conventions, architectural patterns) and want to adapt Qwen2.5-Coder-7B to generate code following your conventions.

Goals:

Generate code using custom APIs (not seen during pretraining)
Follow internal naming conventions and style guides
Handle repository-specific patterns (e.g., config managers, logging utilities)
Maintain general coding ability without catastrophic forgetting

23.2 Step 1: Tokenization Analysis

Check Vocabulary Coverage:

Qwen-Coder uses a large vocabulary (\(\sim\)152k tokens) trained on code corpora. However, your custom APIs may not be well-represented.

Extract custom identifiers: Collect function names, class names, variables from your codebase:

    # identifiers.py
    from transformers import AutoTokenizer
    import re

    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B")

    # Extract identifiers from code
    with open("your_repo/core/api.py") as f:
        code = f.read()
    identifiers = re.findall(r'\b[A-Za-z_][A-Za-z0-9_]*\b', code)

    # Check tokenization
    for name in set(identifiers):
        tokens = tokenizer.tokenize(name)
        if len(tokens) > 1:
            print(f"{name} -> {tokens}  # Fragmented!")

Decision: If \(>\)20% of identifiers fragment into 3+ tokens, consider adding custom tokens to vocabulary. Otherwise, proceed without modification (Qwen-Coder handles long identifiers reasonably).

Vocabulary extension (optional):

    # Add custom tokens (use sparingly!)
    custom_tokens = ["CustomAPIClient", "InternalConfig", ...]
    tokenizer.add_tokens(custom_tokens)
    model.resize_token_embeddings(len(tokenizer))

Warning: New token embeddings are randomly initialized–requires more training data to learn.

23.3 Step 2: Data Preparation

Dataset Construction:

Extract repository snippets:
- Parse Python files, extract functions/classes with docstrings
- Create pairs: (docstring \(\rightarrow\) implementation)
- Filter: remove trivial functions (\(<\)5 lines), keep high-quality comments

Synthetic pair generation: Use GPT-4 or Claude to generate instruction-code pairs:

    # Prompt template
    You are documenting a Python codebase. Given this function:

    ```python
    {function_code}
    `python
    Generate:
    1. A natural language instruction requesting this function
    2. A docstring explaining its purpose
    3. Example usage

Format for instruction tuning:

    # dataset.jsonl (ChatML format for Qwen)
    {
      "messages": [
        {"role": "system", "content": "You are a code assistant..."},
        {"role": "user", "content": "Write a function to load config..."},
        {"role": "assistant", "content": "```python\n{code}\n```"}
      ]
    }

Mix with general code data: Include 20-30% open-source examples (HumanEval, MBPP) to prevent forgetting.

Target Dataset Size:

Minimum: 500 high-quality pairs
Recommended: 2,000-5,000 pairs (mix of real + synthetic)
Maximum: 10,000+ if available (diminishing returns)

23.4 Step 3: LoRA Configuration

Why LoRA for Code Models:

Qwen2.5-Coder-7B has 7B params → full fine-tuning needs 114GB GPU memory
LoRA reduces to 12-24GB (fits on single A10/A100 40GB)
Preserves base model’s general coding ability
Enables multi-repository adapters (train separate LoRAs per codebase)

Recommended Hyperparameters:

Parameter	Value	Rationale
Rank \(r\)	32	Code generation needs higher capacity than classification
Alpha \(\alpha\)	64	\(\alpha = 2r\) for stable scaling
Target modules	Q+K+V+O+FFN	Code requires reasoning (attention + feedforward)
Dropout	0.05	Light regularization
Learning rate	\(3 \times 10^{-4}\)	Standard for LoRA
Batch size	4-8	Per-device, use gradient accumulation
Gradient accum steps	4	Effective batch = 16-32
Epochs	3-5	Monitor validation, stop early
Max seq length	2048	Code context window

23.5 Step 4: Training with PEFT + DeepSpeed

Setup (HuggingFace PEFT):

# train.py
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
import torch

# Load model in 4-bit for QLoRA (optional)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-7B",
    load_in_4bit=True,  # Use QLoRA
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = prepare_model_for_kbit_training(model)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B")
tokenizer.pad_token = tokenizer.eos_token

# LoRA config
lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", 
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
print(f"Trainable params: {model.print_trainable_parameters()}")

# Training arguments
training_args = TrainingArguments(
    output_dir="./qwen-coder-custom-lora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=3e-4,
    fp16=False,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    warmup_steps=50,
    lr_scheduler_type="cosine",
    max_grad_norm=1.0
)

# SFTTrainer for instruction tuning
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    max_seq_length=2048,
    packing=False  # Keep False for code (maintain context boundaries)
)

trainer.train()
trainer.save_model()

Expected Training Time:

Hardware: Single A100 40GB
Dataset: 5,000 examples, max length 2048
Time: \(\sim\)6-8 hours for 3 epochs
Memory: \(\sim\)24GB (QLoRA) or 40GB (16-bit LoRA)

23.6 Step 5: Evaluation

Metrics:

Pass@k on custom test set:
- Create 50-100 held-out instructions from your codebase
- Generate \(k=10\) completions per instruction
- Execute and verify correctness (unit tests)
- Measure: fraction that pass tests
HumanEval retention: Run HumanEval benchmark to ensure no catastrophic forgetting of general coding.
Style adherence: Manual review–does generated code follow your conventions?
- Naming (snake_case vs camelCase)
- Import patterns (from mylib import X vs import mylib.X)
- Error handling (custom exceptions)

Inference Example:

# inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-7B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./qwen-coder-custom-lora")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B")

# Generate
messages = [
    {"role": "system", "content": "You are an expert in our codebase."},
    {"role": "user", "content": "Write a function to load config from YAML"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, 
                                      add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

23.7 Step 6: Deployment Strategies

Option 1: Merge LoRA into base model (single-task):

from peft import PeftModel

model = PeftModel.from_pretrained(base_model, "./qwen-coder-custom-lora")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./qwen-coder-merged")

Pros: Faster inference (no adapter overhead)
Cons: Cannot switch between repositories

Option 2: Multi-adapter serving (multi-task):

Keep base model in memory
Load LoRA adapters dynamically per request (LoRAX, vLLM)
Each repository gets its own adapter
Latency: \(<\)50ms to swap adapters

Option 3: Quantized deployment (edge):

# Quantize merged model to INT8/INT4
from optimum.quanto import quantize, qint4

quantize(merged_model, weights=qint4, activations=None)
merged_model.save_pretrained("./qwen-coder-int4")

23.8 Production Considerations

Data pipeline: Automate scraping new code weekly, retrain LoRA monthly
Version control: Tag adapters with repo commits (lora-v1.2.3)
A/B testing: Serve base model vs LoRA to measure quality improvement
Monitoring: Log generated code → manual review → feedback loop
Compliance: Ensure training data doesn’t leak proprietary secrets (filter credentials, keys)
Multi-repo scaling: Train separate adapters per microservice/team

24 Best Practices & Common Pitfalls

24.1 Best Practices

General:

Start small: Try LoRA with \(r=8\) on Q+V before scaling up
Monitor validation: Stop early if validation loss plateaus or increases
Save checkpoints: Save every epoch–best checkpoint often not the last
Ablate hyperparameters: Test \(r \in \{8, 16, 32\}\) and LR \(\in \{10^{-4}, 3 \times 10^{-4}, 10^{-3}\}\)

Dataset-Specific:

Small data (\(<\)1K): Low rank (\(r=4\)-\(8\)), more regularization (dropout \(0.1\)), more epochs (10-20)
Medium data (1K-10K): Standard recipe (\(r=16\), LR \(3 \times 10^{-4}\), 5-10 epochs)
Large data (\(>\)10K): Higher rank (\(r=32\)-\(64\)), consider full fine-tuning if compute allows

Task-Specific:

Classification/NER: Q+V sufficient, low rank
Generation (summarization, translation): Q+K+V+O, medium rank
Complex reasoning (math, code): Attention + FFN, high rank (\(r=32\)-\(64\))

24.2 Common Pitfalls

Rank too high: Overfits on small data, wasted memory
Learning rate too high: Catastrophic forgetting, loss spikes
Training too long: Overfitting, validation loss increases after epoch 3-5
Ignoring data quality: Synthetic data with errors propagates to model
Not freezing embeddings: Wastes memory, destabilizes rare tokens
Merging adapters prematurely: Test dynamic inference first (may need multi-task)

24.3 Production Recipes

Llama-2-7B Fine-Tuning (LoRA):

Layers: Q+V, \(r=16\), \(\alpha=16\)
LR: \(3 \times 10^{-4}\), warmup 3%, cosine decay
Batch size: 32 (gradient accumulation 4), epochs: 5
Memory: 35GB (A100 40GB)

Llama-2-13B Fine-Tuning (QLoRA):

Base: NF4 + double quantization, Adapters: Q+K+V+O, \(r=32\)
LR: \(1 \times 10^{-4}\), warmup 5%, linear decay
Batch size: 16, epochs: 10
Memory: 18GB (RTX 4090 24GB)

Multi-Task Adapter Serving:

Base model in GPU memory (14GB for 7B)
Hot adapters in GPU (top 10, \(\sim\)100MB each)
Cold adapters on CPU/disk (swap on demand, \(<\)50ms latency)
Use vLLM + LoRAX for batched inference

25 Interview Questions

Note

Q: Why does LoRA work?
A: Fine-tuning updates have low intrinsic rank–most information in top few singular values. Full-rank updates are redundant.

Q: LoRA vs QLoRA?
A: LoRA uses FP16 base (35GB for 7B), QLoRA uses NF4 base (12GB). QLoRA adds \(\sim\)1-2% quality loss but enables larger models on consumer GPUs.

Q: When to use full fine-tuning vs LoRA?
A: Full FT for domain shift + large data + sufficient compute. LoRA for limited compute, small data, multi-task serving.

Q: How to choose rank \(r\)?
A: Start with \(r=16\). Ablate \(\{8, 16, 32\}\) based on validation. Complex tasks (code, math) benefit from \(r=32\)-\(64\). Small data risks overfitting with high \(r\).

Q: Which layers for LoRA?
A: Q+V default, Q+K+V+O for better quality, attention+FFN for max capacity. Always freeze embeddings.

Q: Merged vs dynamic adapters?
A: Merge for single-task deployment (no overhead). Keep dynamic for multi-task serving (10-20% latency cost but flexible).

Q: How to generate synthetic data?
A: Nucleus sampling (\(p=0.9\)) from teacher model. Filter by perplexity, length, toxicity. Human review 5-10%. Examples: Alpaca (52K from GPT-3.5), WizardLM (Evol-Instruct).

26 Summary

Fine-Tuning Hierarchy:

Full Fine-Tuning: Update all params, highest quality, 114GB for 7B model
LoRA: Low-rank adapters, 35GB for 7B, 256× param reduction, \(<\)1% quality loss
QLoRA: 4-bit base + LoRA, 12GB for 7B, 1-2% quality loss, consumer GPU friendly

Key Hyperparameters:

Rank: \(r=16\) default, ablate \(\{8, 16, 32\}\)
Layers: Q+V (default), Q+K+V+O (better), attention+FFN (max)
LR: \(3 \times 10^{-4}\) for LoRA, \(5 \times 10^{-5}\) for full FT
Epochs: 5-10 for LoRA, 1-3 for full FT

Deployment Strategies:

Single-task: Merge adapters into base model (no overhead)
Multi-task: Dynamic adapters with batched inference (S-LoRA, vLLM)
Extreme memory: QLoRA + gradient checkpointing + DeepSpeed ZeRO

Synthetic Data:

Teacher sampling with nucleus (\(p=0.9\)) for diversity
Filter by perplexity, length, toxicity, diversity
Self-Instruct / Evol-Instruct for bootstrapping
Mix 70-90% synthetic with 10-30% real data