Fine-Tuning Guide
This guide covers parameter-efficient fine-tuning methods available in the LLM framework.
Overview
| Method | Memory | Use Case |
|---|---|---|
| Full Fine-Tuning | 100% | Small models, abundant GPU |
| LoRA | ~10% | General PEFT, good balance |
| QLoRA | ~5% | Large models, limited GPU |
LoRA (Low-Rank Adaptation)
LoRA adds trainable low-rank matrices to frozen linear layers, reducing trainable parameters by 90%+.
Basic Usage
from llm.models import DecoderModel
from llm.core.lora import apply_lora, get_lora_parameters, merge_lora
# 1. Create/load model
model = DecoderModel(vocab_size=32000, hidden_size=768, num_layers=12)
# 2. Apply LoRA
apply_lora(
model,
rank=8, # Low-rank dimension
alpha=16.0, # Scaling factor
dropout=0.1, # Regularization
target_modules=["qkv_proj", "out_proj"], # Which layers to adapt
)
# 3. Train with LoRA parameters only
optimizer = torch.optim.AdamW(get_lora_parameters(model), lr=1e-4)
# 4. For inference: merge weights
merge_lora(model) # LoRA weights merged into base, no extra latency
Configuration Tips
| Parameter | Recommendation |
|---|---|
rank |
4-16 for most tasks, higher for complex tasks |
alpha |
Usually 2x rank (e.g., rank=8 → alpha=16) |
target_modules |
QKV + Output projections in attention |
QLoRA (Quantized LoRA)
QLoRA combines 4-bit quantization with LoRA for extreme memory efficiency.
Basic Usage
from llm.core.qlora import apply_qlora, get_qlora_parameters
# Apply QLoRA (base weights quantized to 4-bit NF4)
apply_qlora(
model,
rank=8,
alpha=16.0,
block_size=64, # Quantization block size
)
# Train
optimizer = torch.optim.AdamW(get_qlora_parameters(model), lr=1e-4)
Memory Comparison
For a 7B parameter model:
| Method | Base Weights | Trainable | Total VRAM |
|---|---|---|---|
| Full FT | 14GB (fp16) | 14GB | ~28GB |
| LoRA | 14GB (fp16) | 0.1GB | ~14GB |
| QLoRA | 3.5GB (4-bit) | 0.1GB | ~4GB |
How NF4 Quantization Works
graph LR
A[FP16 Weights] --> B[Block-wise Normalization]
B --> C[Map to NF4 Levels]
C --> D[4-bit Indices + Scales]
D --> E[Dequantize on Forward]
E --> F[FP16 for Compute]
NF4 (Normal Float 4-bit) uses 16 carefully chosen quantization levels optimized for normally distributed weights.
Best Practices
1. Choose the Right Method
- LoRA: When you have 1-2 GPUs with 16-24GB VRAM
- QLoRA: When memory is severely limited (8GB GPU) or model is very large (13B+)
2. Target Module Selection
For transformer models, prioritize:
qkv_proj/q_proj,k_proj,v_proj(attention queries/keys/values)out_proj(attention output)- Linear layers in MLP (optional, diminishing returns)
3. Hyperparameters
# Conservative start
apply_lora(model, rank=4, alpha=8)
# More capacity if underfitting
apply_lora(model, rank=16, alpha=32)
# With regularization for small datasets
apply_lora(model, rank=8, alpha=16, dropout=0.1)
4. Saving and Loading
# Save only LoRA weights (small file)
lora_state = {name: p for name, p in model.named_parameters() if p.requires_grad}
torch.save(lora_state, "lora_weights.pt")
# Load: first apply_lora, then load weights
apply_lora(model, rank=8, alpha=16)
model.load_state_dict(torch.load("lora_weights.pt"), strict=False)
API Reference
LoRA Functions
| Function | Description |
|---|---|
apply_lora(model, ...) |
Apply LoRA to model |
merge_lora(model) |
Merge LoRA into base weights |
unmerge_lora(model) |
Undo merge |
get_lora_parameters(model) |
Get trainable params |
disable_lora(model) |
Temporarily disable |
enable_lora(model) |
Re-enable |
QLoRA Functions
| Function | Description |
|---|---|
apply_qlora(model, ...) |
Apply QLoRA (quantizes base) |
get_qlora_parameters(model) |
Get trainable params |
quantize_nf4(tensor) |
Manual NF4 quantization |
dequantize_nf4(indices, scales, ...) |
Dequantize NF4 |