15  Chapter 14: LLM Fine-Tuning

16 Introduction

Fine-tuning adapts pre-trained LLMs to specific tasks or domains. Key challenges:

  • Memory: Full fine-tuning requires storing optimizer states (2-4× model size for Adam)

  • Compute: Backward pass through all layers is expensive

  • Catastrophic Forgetting: Aggressive updates degrade general knowledge

  • Multi-Task Serving: Switching between task-specific checkpoints is slow

Solution Hierarchy:

  1. Full Fine-Tuning: Update all parameters (baseline, most expensive)

  2. Parameter-Efficient Fine-Tuning (PEFT): Update small subset or low-rank adapters

  3. LoRA: Low-rank adaptation–inject trainable rank decomposition matrices

  4. QLoRA: LoRA + quantized base model (4-bit) for extreme memory efficiency

This Chapter Covers:

  • Full fine-tuning foundations and when to use it

  • LoRA mechanics, rank selection, layer targeting

  • QLoRA: 4-bit quantization + LoRA for consumer GPUs

  • Adapter architectures: serial, parallel, fusion strategies

  • Compute optimizations: merged vs dynamic adapters

  • Synthetic data generation for fine-tuning

17 Full Fine-Tuning

17.1 When to Use Full Fine-Tuning

Use Cases:

  • Domain shift: Medical, legal, code (vocabulary/distribution far from pre-training)

  • Small models: \(<\)3B parameters where memory is manageable

  • Maximum performance: Task requires full model capacity (e.g., complex reasoning)

  • Sufficient data: 10K+ high-quality examples (low risk of overfitting)

Avoid When:

  • Limited compute (1-2 GPUs, \(<\)40GB VRAM each)

  • Small dataset (\(<\)1K examples)–high overfitting risk

  • Multi-task serving (can’t afford multiple full checkpoints)

17.2 Training Recipe

Hyperparameters:

  • Learning rate: \(5 \times 10^{-6}\) to \(5 \times 10^{-5}\) (10-100× lower than pre-training)

  • Batch size: 8-32 per GPU with gradient accumulation

  • Epochs: 1-3 (monitor validation perplexity closely)

  • Optimizer: AdamW with \(\beta_1=0.9\), \(\beta_2=0.95\), weight decay \(0.1\)

  • Warmup: 3-10% of steps

  • Scheduler: Cosine decay to \(10\%\) of peak LR

Layer-Specific Learning Rates:

  • Embeddings: \(1 \times 10^{-6}\) (preserve token representations)

  • Early layers: \(2 \times 10^{-6}\) (low-level features stable)

  • Middle layers: \(1 \times 10^{-5}\) (task-specific features)

  • Final layers: \(5 \times 10^{-5}\) (task head, most adaptation)

17.3 Memory Requirements

For a model with \(P\) parameters in FP16/BF16:

  • Model weights: \(2P\) bytes

  • Gradients: \(2P\) bytes

  • Optimizer states (Adam): \(8P\) bytes (FP32 first/second moments)

  • Activations: Depends on batch size and sequence length (often \(>10P\) for LLMs)

  • Total: \(\sim 12P + \text{activations}\)

Example: Llama-2-7B (7B parameters \(\times\) 2 bytes = 14GB) requires: \[14\text{GB (weights)} + 14\text{GB (grads)} + 56\text{GB (Adam)} + 30\text{GB (activations)} \approx 114\text{GB}\]

Requires A100 80GB or multi-GPU with model parallelism.

Note

Gradient Checkpointing: Trades compute for memory by recomputing activations during backward pass instead of storing them. Reduces activation memory by \(\sim\)3-5× but increases training time by \(\sim\)20-30%. Essential for large models on limited VRAM.

18 Low-Rank Adaptation (LoRA)

18.1 Core Idea

LoRA hypothesis: Fine-tuning updates have low intrinsic rank.

Instead of updating weight matrix \(W \in \mathbb{R}^{d \times k}\), inject low-rank decomposition: \[W' = W + \Delta W = W + BA\] where:

  • \(W\): Frozen pre-trained weights (\(d \times k\))

  • \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\): Trainable low-rank matrices

  • \(r \ll \min(d, k)\): Rank (typically \(r = 4\)-\(64\))

Forward Pass: \[h = W x + \frac{\alpha}{r} BA x\] where \(\alpha\) is a scaling factor (typically \(\alpha = r\) so scaling is \(1\)).

Parameter Reduction: Original has \(dk\) parameters; LoRA adds \(r(d+k)\).

For \(d=k=4096\) and \(r=8\): \[\text{Original: } 4096^2 = 16.8\text{M params} \quad \text{LoRA: } 8 \times 8192 = 65.5\text{K params} \quad (**256× reduction**)\]

18.2 Initialization

Standard LoRA:

  • \(A \sim \mathcal{N}(0, \sigma^2)\) where \(\sigma = 1/\sqrt{r}\) (Kaiming-like)

  • \(B = 0\) (ensures \(\Delta W = 0\) at initialization–model starts identical to base)

  • Scaling \(\Delta W = \frac{\alpha}{r} BA\) with \(\alpha \approx r\) keeps update magnitude stable

Why this works: Setting \(B=0\) preserves the pretrained model at step 0. Gradients first move \(B\) off zero, then \(A\) starts learning. Initializing \(A=0\) would block learning (zero gradients for \(B\)).

Alternative (Warm Start):

  • Initialize \(BA\) via SVD of small full fine-tuning update: \(\Delta W_{\text{warmup}} \approx U \Sigma V^T\), then \(B = U_{:r} \Sigma_{:r}^{1/2}\), \(A = \Sigma_{:r}^{1/2} V_{:r}^T\)

  • This SVD-informed initialization starts LoRA near a known good update direction, improving early convergence

18.3 Which Layers to Apply LoRA?

Typical Choices (Transformers):

  • Query/Value (Q, V) only: Default in original paper, works well for most tasks

  • All attention (Q, K, V, O): Better for complex tasks, 2× more params

  • Attention + FFN: Maximum capacity, 3-4× more params than Q/V only

  • Freeze embeddings: Almost always–embeddings are task-agnostic

Rule of Thumb:

  • Small dataset (\(<\)1K): Q+V only, low rank (\(r=4\)-\(8\))

  • Medium dataset (1K-10K): Q+K+V+O, medium rank (\(r=16\)-\(32\))

  • Large dataset (\(>\)10K): Attention + FFN, high rank (\(r=32\)-\(64\))

18.4 Rank Selection

Trade-offs:

  • Low rank (\(r=4\)-\(8\)): Minimal memory, fast, risk of underfitting

  • Medium rank (\(r=16\)-\(32\)): Good balance, most common

  • High rank (\(r=64\)-\(128\)): Approaches full fine-tuning, higher overfitting risk

Empirical Observations:

  • Task difficulty matters more than dataset size

  • Complex reasoning (math, code) benefits from \(r=32\)-\(64\)

  • Simple classification often saturates at \(r=8\)-\(16\)

  • Ablation study: Train \(r=8, 16, 32\) and pick best validation performance

Note

Adaptive Rank: Some frameworks (e.g., AdaLoRA) dynamically adjust rank per layer during training by pruning low-importance singular values. Saves memory while maintaining performance.

18.5 Training Recipe

Hyperparameters:

  • Learning rate: \(1 \times 10^{-4}\) to \(5 \times 10^{-4}\) (10× higher than full fine-tuning)

  • Batch size: 16-64 (can use larger since memory footprint is small)

  • Epochs: 3-10 (LoRA trains faster than full fine-tuning)

  • Optimizer: AdamW (only store states for LoRA params, not base model)

  • Warmup: 5-10% of steps

  • Scheduler: Linear or cosine decay

Memory Savings:

  • Base model \(W\): FP16, frozen (no gradients or optimizer states)

  • LoRA matrices \(A, B\): FP32 or BF16 with optimizer states

  • Total memory: \(2P_{\text{base}} + 12P_{\text{LoRA}}\)

For Llama-2-7B with \(r=16\) on Q/V (adds \(\sim\)20M trainable params): \[14\text{GB (base)} + 0.24\text{GB (LoRA states)} + 20\text{GB (activations)} \approx 35\text{GB}\] Fits on single A100 40GB!

19 Quantized LoRA (QLoRA)

19.1 Core Idea

QLoRA = Quantize base model to 4-bit + LoRA adapters in high precision.

Motivation: Base model weights \(W\) frozen, so can aggressively quantize. Adapter matrices \(B, A\) trained in FP16/BF16 for stability.

Key Difference from LoRA:

  • LoRA: Base model in FP16/BF16 (14GB for 7B model)

  • QLoRA: Base model in NF4 (Normal Float 4-bit) + double quantization (3.5GB for 7B model)

19.2 NF4 Quantization

Normal Float 4-bit (NF4): Quantization levels chosen to match Gaussian distribution \(\mathcal{N}(0, 1)\).

Why? Pre-trained weights approximately \(\mathcal{N}(0, \sigma^2)\). Standard uniform INT4 wastes bins in tail; NF4 concentrates bins near zero.

Key Property: 16 quantization levels positioned such that each bin has equal probability under \(\mathcal{N}(0,1)\) (information-theoretically optimal for Gaussian data).

Double Quantization: Quantize the scale factors themselves to INT8 (saves \(\sim\)0.5GB for 7B model).

19.3 QLoRA vs LoRA

Method Base Model Memory (7B) Performance
Full Fine-Tuning FP16 114GB Baseline
LoRA FP16 35GB \(-0.5\%\) to \(-1\%\)
QLoRA NF4 12GB \(-1\%\) to \(-2\%\)

When to Use QLoRA:

  • Consumer GPUs (RTX 4090 24GB, RTX 3090 24GB)

  • Training 13B-33B models on single GPU

  • Extreme memory constraint (e.g., fine-tuning on laptop with 16GB VRAM)

Trade-offs:

  • Pros: 3-4× memory reduction vs LoRA, enables larger models

  • Cons: Slower training (\(\sim\)30% due to dequantization overhead), slight quality drop (\(\sim\)1-2%)

19.4 Training Recipe

Same as LoRA except:

  • Slightly higher rank (\(r=32\)-\(64\)) to compensate for base model quantization

  • May need lower learning rate (\(5 \times 10^{-5}\)) for stability

  • More epochs (5-15) to converge due to quantization noise

Note

Page Optimizer: QLoRA uses paged optimizers (offload optimizer states to CPU RAM when not needed). Essential for fitting 33B+ models on 24GB GPU.

20 Adapter Architectures

Beyond LoRA’s parallel low-rank injection, several adapter patterns exist.

20.1 Serial Adapters (Houlsby-style)

Architecture: Insert adapter modules after each transformer sub-layer.

\(h_1 = \text{Attention}(x) + x\) \(h_2 = \text{Adapter}(h_1) + h_1\) \(h_3 = \text{FFN}(h_2) + h_2\) \(h_4 = \text{Adapter}(h_3) + h_3\)

Adapter Module: \[\text{Adapter}(h) = W_{\text{up}} \cdot \sigma(W_{\text{down}} \cdot h)\] where \(W_{\text{down}} \in \mathbb{R}^{d \times r}\), \(W_{\text{up}} \in \mathbb{R}^{r \times d}\), \(r \ll d\) (bottleneck).

Note

Adapters vs LoRA: Both are Low-Rank Bottlenecks

image

Mapping:

Component Adapter LoRA
Down-projection (\(d \to r\)) \(W_{\text{down}}\) \(A\)
Up-projection (\(r \to d\)) \(W_{\text{up}}\) \(B\)
Non-linearity Yes (\(\sigma\)) No
Placement Serial (after layer) Parallel (with layer)
Merge at inference? No (sequential) Yes (\(W' = W + BA\))

Both use \(2 \times d \times r\) trainable parameters per module. LoRA’s lack of non-linearity is offset by ability to merge weights, eliminating inference overhead.

Pros/Cons:

  • Pros: More expressive (non-linear via \(\sigma\)), good for multi-task learning

  • Cons: Adds latency (sequential), 2× more adapter modules than LoRA

20.2 Parallel Adapters (LoRA-style)

Architecture: Add adapter output in parallel with original layer.

\(h = W x + \text{Adapter}(x) + x\)

For LoRA: \(\text{Adapter}(x) = \frac{\alpha}{r} BA x\)

Pros/Cons:

  • Pros: No latency increase (fused computation), simpler training

  • Cons: Linear only (no non-linearity), slightly less expressive

20.3 Adapter Fusion

Problem: After training task-specific adapters, how to combine them for multi-task inference?

Naive: Switch adapters per task (requires model reload).

Fusion: Learn weighted combination of adapters with small fusion layer: \[h = Wx + \sum_{i=1}^{N} \alpha_i \cdot \text{Adapter}_i(x)\] where \(\alpha_i\) learned via attention over task embeddings.

Use Case: Multi-task serving where model handles multiple domains (e.g., chatbot with medical/legal/general knowledge).

21 Compute Optimizations

21.1 Merged vs Dynamic Adapters

Merged Adapters (Deployment):

  • After training, compute \(W' = W + BA\) and save single checkpoint

  • Pros: No inference overhead, same speed as base model

  • Cons: Cannot switch tasks dynamically, requires separate checkpoint per task

Implementation:

# Merge LoRA into base model
W_merged = W_base + (lora_B @ lora_A) * (alpha / rank)

Dynamic Adapters (Multi-Task Serving):

  • Keep base model \(W\) frozen, compute \(Wx + BAx\) at runtime

  • Pros: Switch adapters per request (load \(B, A\) from disk), single base model for all tasks

  • Cons: \(\sim\)10-20% latency overhead for separate matrix multiplies

21.2 Batched Multi-Adapter Inference

Problem: Batch contains requests for different tasks (different adapters).

Naive: Run each request separately (no batching benefit).

Optimized (S-LoRA): Compute base model forward pass once, then apply task-specific adapters:

\(H_{\text{base}} = W X\) \(H_t = H_{\text{base}} + B_t A_t X_t\)

Memory Management: Keep hot adapters in GPU memory, swap cold adapters to CPU/disk.

Production Example: vLLM + LoRAX supports batched multi-adapter inference with \(<\)5% throughput degradation vs single-task.

21.3 Fused Kernels for LoRA

Standard Approach:

  1. Compute \(y_1 = Wx\) (base GEMM)

  2. Compute \(y_2 = Ax\) (adapter GEMM)

  3. Compute \(y_3 = By_2\) (adapter GEMM)

  4. Add: \(y = y_1 + \frac{\alpha}{r} y_3\)

Total: 3 kernel launches, poor memory locality.

Fused Kernel:

  • Single kernel computes \(y = Wx + \frac{\alpha}{r} B(Ax)\)

  • Speedup: 20-40% faster than naive implementation

  • Available in: PEFT library, vLLM, TensorRT-LLM

21.4 Training Speedups

Gradient Checkpointing: Even with LoRA, activations dominate memory. Checkpointing reduces memory by 3-5×.

Mixed Precision:

  • Base model: FP16/BF16 (or NF4 for QLoRA)

  • Adapters: FP32 (higher precision for stability)

  • Gradients: FP16 (reduce memory)

DeepSpeed ZeRO: Shard optimizer states across GPUs (Stage 2) or all parameters (Stage 3). Enables LoRA training of 70B+ models on 8× A100.

22 Synthetic Data for Fine-Tuning

22.1 When to Use Synthetic Data

Scenarios:

  • Limited labeled data (\(<\)100 examples)

  • Domain-specific task with no public dataset (e.g., company-internal QA)

  • Data augmentation for low-resource tasks

  • Bootstrapping for instruction tuning

22.2 Generation Methods

22.2.1 Teacher Model Sampling

Same as distillation (nucleus sampling, \(p=0.9\)):

  1. Curate seed prompts (100-1K examples covering task distribution)

  2. Generate completions from larger teacher model

  3. Filter by quality: perplexity \(<\) threshold, length in range, no toxicity

  4. Fine-tune student on synthetic (prompt, completion) pairs

Production Example: Alpaca (52K instruction-following examples generated from GPT-3.5-Turbo + 175 seed prompts).

22.2.2 Self-Instruct

Idea: Use model to generate its own training data iteratively.

Initialize with small seed set (e.g., 50 examples) Sample \(n\) seed examples from current dataset Prompt model to generate new instructions Generate outputs for new instructions Filter for quality and diversity Add to dataset and fine-tune

Challenges:

  • Quality degrades over iterations (model amplifies its own mistakes)

  • Requires strong base model (GPT-3.5+ level)

  • Need diversity filters (avoid redundant examples)

22.2.3 Evol-Instruct

Idea: Iteratively increase complexity of instructions.

Complexity Operations:

  • Add constraints (e.g., "in 100 words", "without using the letter ‘e’")

  • Increase reasoning steps (multi-hop questions)

  • Add domain knowledge requirements

  • Combine multiple skills (summarize + translate)

Example:

  • Seed: "Summarize this article."

  • Evolved: "Summarize this medical research article in layman’s terms, focusing on clinical implications, in under 150 words."

Used in WizardLM, WizardCoder.

22.3 Quality Control

Filters:

  • Perplexity: Reject examples with \(\text{PPL} > 100\) (likely gibberish)

  • Length: Filter too short (\(<\)20 tokens) or too long (\(>\)2K tokens)

  • Diversity: Use embedding clustering, discard near-duplicates

  • Toxicity: Run Perspective API or toxicity classifier

  • Task adherence: Prompt-based validation (does output follow instruction?)

Human-in-the-Loop:

  • Sample 5-10% of synthetic data for manual review

  • Identify systematic errors (e.g., model always refuses certain instructions)

  • Refine generation prompts based on failure modes

23 End-to-End: Fine-Tuning Qwen-Coder for Custom Repository

23.1 Use Case & Goals

Scenario: You have a proprietary codebase (e.g., internal Python framework with custom APIs, naming conventions, architectural patterns) and want to adapt Qwen2.5-Coder-7B to generate code following your conventions.

Goals:

  • Generate code using custom APIs (not seen during pretraining)

  • Follow internal naming conventions and style guides

  • Handle repository-specific patterns (e.g., config managers, logging utilities)

  • Maintain general coding ability without catastrophic forgetting

23.2 Step 1: Tokenization Analysis

Check Vocabulary Coverage:

Qwen-Coder uses a large vocabulary (\(\sim\)152k tokens) trained on code corpora. However, your custom APIs may not be well-represented.

  1. Extract custom identifiers: Collect function names, class names, variables from your codebase:

        # identifiers.py
        from transformers import AutoTokenizer
        import re
    
        tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B")
    
        # Extract identifiers from code
        with open("your_repo/core/api.py") as f:
            code = f.read()
        identifiers = re.findall(r'\b[A-Za-z_][A-Za-z0-9_]*\b', code)
    
        # Check tokenization
        for name in set(identifiers):
            tokens = tokenizer.tokenize(name)
            if len(tokens) > 1:
                print(f"{name} -> {tokens}  # Fragmented!")
  2. Decision: If \(>\)20% of identifiers fragment into 3+ tokens, consider adding custom tokens to vocabulary. Otherwise, proceed without modification (Qwen-Coder handles long identifiers reasonably).

  3. Vocabulary extension (optional):

        # Add custom tokens (use sparingly!)
        custom_tokens = ["CustomAPIClient", "InternalConfig", ...]
        tokenizer.add_tokens(custom_tokens)
        model.resize_token_embeddings(len(tokenizer))

    Warning: New token embeddings are randomly initialized–requires more training data to learn.

23.3 Step 2: Data Preparation

Dataset Construction:

  1. Extract repository snippets:

    • Parse Python files, extract functions/classes with docstrings

    • Create pairs: (docstring \(\rightarrow\) implementation)

    • Filter: remove trivial functions (\(<\)5 lines), keep high-quality comments

  2. Synthetic pair generation: Use GPT-4 or Claude to generate instruction-code pairs:

        # Prompt template
        You are documenting a Python codebase. Given this function:
    
        ```python
        {function_code}
        `python
        Generate:
        1. A natural language instruction requesting this function
        2. A docstring explaining its purpose
        3. Example usage
  3. Format for instruction tuning:

        # dataset.jsonl (ChatML format for Qwen)
        {
          "messages": [
            {"role": "system", "content": "You are a code assistant..."},
            {"role": "user", "content": "Write a function to load config..."},
            {"role": "assistant", "content": "```python\n{code}\n```"}
          ]
        }
  4. Mix with general code data: Include 20-30% open-source examples (HumanEval, MBPP) to prevent forgetting.

Target Dataset Size:

  • Minimum: 500 high-quality pairs

  • Recommended: 2,000-5,000 pairs (mix of real + synthetic)

  • Maximum: 10,000+ if available (diminishing returns)

23.4 Step 3: LoRA Configuration

Why LoRA for Code Models:

  • Qwen2.5-Coder-7B has 7B params → full fine-tuning needs 114GB GPU memory

  • LoRA reduces to 12-24GB (fits on single A10/A100 40GB)

  • Preserves base model’s general coding ability

  • Enables multi-repository adapters (train separate LoRAs per codebase)

Recommended Hyperparameters:

Parameter Value Rationale
Rank \(r\) 32 Code generation needs higher capacity than classification
Alpha \(\alpha\) 64 \(\alpha = 2r\) for stable scaling
Target modules Q+K+V+O+FFN Code requires reasoning (attention + feedforward)
Dropout 0.05 Light regularization
Learning rate \(3 \times 10^{-4}\) Standard for LoRA
Batch size 4-8 Per-device, use gradient accumulation
Gradient accum steps 4 Effective batch = 16-32
Epochs 3-5 Monitor validation, stop early
Max seq length 2048 Code context window

23.5 Step 4: Training with PEFT + DeepSpeed

Setup (HuggingFace PEFT):

# train.py
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
import torch

# Load model in 4-bit for QLoRA (optional)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-7B",
    load_in_4bit=True,  # Use QLoRA
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = prepare_model_for_kbit_training(model)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B")
tokenizer.pad_token = tokenizer.eos_token

# LoRA config
lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", 
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
print(f"Trainable params: {model.print_trainable_parameters()}")

# Training arguments
training_args = TrainingArguments(
    output_dir="./qwen-coder-custom-lora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=3e-4,
    fp16=False,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    warmup_steps=50,
    lr_scheduler_type="cosine",
    max_grad_norm=1.0
)

# SFTTrainer for instruction tuning
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    max_seq_length=2048,
    packing=False  # Keep False for code (maintain context boundaries)
)

trainer.train()
trainer.save_model()

Expected Training Time:

  • Hardware: Single A100 40GB

  • Dataset: 5,000 examples, max length 2048

  • Time: \(\sim\)6-8 hours for 3 epochs

  • Memory: \(\sim\)24GB (QLoRA) or 40GB (16-bit LoRA)

23.6 Step 5: Evaluation

Metrics:

  1. Pass@k on custom test set:

    • Create 50-100 held-out instructions from your codebase

    • Generate \(k=10\) completions per instruction

    • Execute and verify correctness (unit tests)

    • Measure: fraction that pass tests

  2. HumanEval retention: Run HumanEval benchmark to ensure no catastrophic forgetting of general coding.

  3. Style adherence: Manual review–does generated code follow your conventions?

    • Naming (snake_case vs camelCase)

    • Import patterns (from mylib import X vs import mylib.X)

    • Error handling (custom exceptions)

Inference Example:

# inference.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-7B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./qwen-coder-custom-lora")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B")

# Generate
messages = [
    {"role": "system", "content": "You are an expert in our codebase."},
    {"role": "user", "content": "Write a function to load config from YAML"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, 
                                      add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

23.7 Step 6: Deployment Strategies

Option 1: Merge LoRA into base model (single-task):

from peft import PeftModel

model = PeftModel.from_pretrained(base_model, "./qwen-coder-custom-lora")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./qwen-coder-merged")

Pros: Faster inference (no adapter overhead)
Cons: Cannot switch between repositories

Option 2: Multi-adapter serving (multi-task):

  • Keep base model in memory

  • Load LoRA adapters dynamically per request (LoRAX, vLLM)

  • Each repository gets its own adapter

  • Latency: \(<\)50ms to swap adapters

Option 3: Quantized deployment (edge):

# Quantize merged model to INT8/INT4
from optimum.quanto import quantize, qint4

quantize(merged_model, weights=qint4, activations=None)
merged_model.save_pretrained("./qwen-coder-int4")

23.8 Production Considerations

  • Data pipeline: Automate scraping new code weekly, retrain LoRA monthly

  • Version control: Tag adapters with repo commits (lora-v1.2.3)

  • A/B testing: Serve base model vs LoRA to measure quality improvement

  • Monitoring: Log generated code → manual review → feedback loop

  • Compliance: Ensure training data doesn’t leak proprietary secrets (filter credentials, keys)

  • Multi-repo scaling: Train separate adapters per microservice/team

24 Best Practices & Common Pitfalls

24.1 Best Practices

General:

  • Start small: Try LoRA with \(r=8\) on Q+V before scaling up

  • Monitor validation: Stop early if validation loss plateaus or increases

  • Save checkpoints: Save every epoch–best checkpoint often not the last

  • Ablate hyperparameters: Test \(r \in \{8, 16, 32\}\) and LR \(\in \{10^{-4}, 3 \times 10^{-4}, 10^{-3}\}\)

Dataset-Specific:

  • Small data (\(<\)1K): Low rank (\(r=4\)-\(8\)), more regularization (dropout \(0.1\)), more epochs (10-20)

  • Medium data (1K-10K): Standard recipe (\(r=16\), LR \(3 \times 10^{-4}\), 5-10 epochs)

  • Large data (\(>\)10K): Higher rank (\(r=32\)-\(64\)), consider full fine-tuning if compute allows

Task-Specific:

  • Classification/NER: Q+V sufficient, low rank

  • Generation (summarization, translation): Q+K+V+O, medium rank

  • Complex reasoning (math, code): Attention + FFN, high rank (\(r=32\)-\(64\))

24.2 Common Pitfalls

  • Rank too high: Overfits on small data, wasted memory

  • Learning rate too high: Catastrophic forgetting, loss spikes

  • Training too long: Overfitting, validation loss increases after epoch 3-5

  • Ignoring data quality: Synthetic data with errors propagates to model

  • Not freezing embeddings: Wastes memory, destabilizes rare tokens

  • Merging adapters prematurely: Test dynamic inference first (may need multi-task)

24.3 Production Recipes

Llama-2-7B Fine-Tuning (LoRA):

  • Layers: Q+V, \(r=16\), \(\alpha=16\)

  • LR: \(3 \times 10^{-4}\), warmup 3%, cosine decay

  • Batch size: 32 (gradient accumulation 4), epochs: 5

  • Memory: 35GB (A100 40GB)

Llama-2-13B Fine-Tuning (QLoRA):

  • Base: NF4 + double quantization, Adapters: Q+K+V+O, \(r=32\)

  • LR: \(1 \times 10^{-4}\), warmup 5%, linear decay

  • Batch size: 16, epochs: 10

  • Memory: 18GB (RTX 4090 24GB)

Multi-Task Adapter Serving:

  • Base model in GPU memory (14GB for 7B)

  • Hot adapters in GPU (top 10, \(\sim\)100MB each)

  • Cold adapters on CPU/disk (swap on demand, \(<\)50ms latency)

  • Use vLLM + LoRAX for batched inference

25 Interview Questions

Note

Q: Why does LoRA work?
A: Fine-tuning updates have low intrinsic rank–most information in top few singular values. Full-rank updates are redundant.

Q: LoRA vs QLoRA?
A: LoRA uses FP16 base (35GB for 7B), QLoRA uses NF4 base (12GB). QLoRA adds \(\sim\)1-2% quality loss but enables larger models on consumer GPUs.

Q: When to use full fine-tuning vs LoRA?
A: Full FT for domain shift + large data + sufficient compute. LoRA for limited compute, small data, multi-task serving.

Q: How to choose rank \(r\)?
A: Start with \(r=16\). Ablate \(\{8, 16, 32\}\) based on validation. Complex tasks (code, math) benefit from \(r=32\)-\(64\). Small data risks overfitting with high \(r\).

Q: Which layers for LoRA?
A: Q+V default, Q+K+V+O for better quality, attention+FFN for max capacity. Always freeze embeddings.

Q: Merged vs dynamic adapters?
A: Merge for single-task deployment (no overhead). Keep dynamic for multi-task serving (10-20% latency cost but flexible).

Q: How to generate synthetic data?
A: Nucleus sampling (\(p=0.9\)) from teacher model. Filter by perplexity, length, toxicity. Human review 5-10%. Examples: Alpaca (52K from GPT-3.5), WizardLM (Evol-Instruct).

26 Summary

Fine-Tuning Hierarchy:

  1. Full Fine-Tuning: Update all params, highest quality, 114GB for 7B model

  2. LoRA: Low-rank adapters, 35GB for 7B, 256× param reduction, \(<\)1% quality loss

  3. QLoRA: 4-bit base + LoRA, 12GB for 7B, 1-2% quality loss, consumer GPU friendly

Key Hyperparameters:

  • Rank: \(r=16\) default, ablate \(\{8, 16, 32\}\)

  • Layers: Q+V (default), Q+K+V+O (better), attention+FFN (max)

  • LR: \(3 \times 10^{-4}\) for LoRA, \(5 \times 10^{-5}\) for full FT

  • Epochs: 5-10 for LoRA, 1-3 for full FT

Deployment Strategies:

  • Single-task: Merge adapters into base model (no overhead)

  • Multi-task: Dynamic adapters with batched inference (S-LoRA, vLLM)

  • Extreme memory: QLoRA + gradient checkpointing + DeepSpeed ZeRO

Synthetic Data:

  • Teacher sampling with nucleus (\(p=0.9\)) for diversity

  • Filter by perplexity, length, toxicity, diversity

  • Self-Instruct / Evol-Instruct for bootstrapping

  • Mix 70-90% synthetic with 10-30% real data