Fine-Tuning Guide

This guide covers parameter-efficient fine-tuning methods available in the LLM framework.

Overview

Method	Memory	Use Case
Full Fine-Tuning	100%	Small models, abundant GPU
LoRA	~10%	General PEFT, good balance
QLoRA	~5%	Large models, limited GPU

LoRA (Low-Rank Adaptation)

LoRA adds trainable low-rank matrices to frozen linear layers, reducing trainable parameters by 90%+.

Basic Usage

from llm.models import DecoderModel
from llm.core.lora import apply_lora, get_lora_parameters, merge_lora

# 1. Create/load model
model = DecoderModel(vocab_size=32000, hidden_size=768, num_layers=12)

# 2. Apply LoRA
apply_lora(
    model,
    rank=8,           # Low-rank dimension
    alpha=16.0,       # Scaling factor
    dropout=0.1,      # Regularization
    target_modules=["qkv_proj", "out_proj"],  # Which layers to adapt
)

# 3. Train with LoRA parameters only
optimizer = torch.optim.AdamW(get_lora_parameters(model), lr=1e-4)

# 4. For inference: merge weights
merge_lora(model)  # LoRA weights merged into base, no extra latency

Configuration Tips

Parameter	Recommendation
`rank`	4-16 for most tasks, higher for complex tasks
`alpha`	Usually 2x rank (e.g., rank=8 → alpha=16)
`target_modules`	QKV + Output projections in attention

QLoRA (Quantized LoRA)

QLoRA combines 4-bit quantization with LoRA for extreme memory efficiency.

Basic Usage

from llm.core.qlora import apply_qlora, get_qlora_parameters

# Apply QLoRA (base weights quantized to 4-bit NF4)
apply_qlora(
    model,
    rank=8,
    alpha=16.0,
    block_size=64,    # Quantization block size
)

# Train
optimizer = torch.optim.AdamW(get_qlora_parameters(model), lr=1e-4)

Memory Comparison

For a 7B parameter model:

Method	Base Weights	Trainable	Total VRAM
Full FT	14GB (fp16)	14GB	~28GB
LoRA	14GB (fp16)	0.1GB	~14GB
QLoRA	3.5GB (4-bit)	0.1GB	~4GB

How NF4 Quantization Works

graph LR
    A[FP16 Weights] --> B[Block-wise Normalization]
    B --> C[Map to NF4 Levels]
    C --> D[4-bit Indices + Scales]

    D --> E[Dequantize on Forward]
    E --> F[FP16 for Compute]

NF4 (Normal Float 4-bit) uses 16 carefully chosen quantization levels optimized for normally distributed weights.

Best Practices

1. Choose the Right Method

LoRA: When you have 1-2 GPUs with 16-24GB VRAM
QLoRA: When memory is severely limited (8GB GPU) or model is very large (13B+)

2. Target Module Selection

For transformer models, prioritize:

qkv_proj / q_proj, k_proj, v_proj (attention queries/keys/values)
out_proj (attention output)
Linear layers in MLP (optional, diminishing returns)

3. Hyperparameters

# Conservative start
apply_lora(model, rank=4, alpha=8)

# More capacity if underfitting
apply_lora(model, rank=16, alpha=32)

# With regularization for small datasets
apply_lora(model, rank=8, alpha=16, dropout=0.1)

4. Saving and Loading

# Save only LoRA weights (small file)
lora_state = {name: p for name, p in model.named_parameters() if p.requires_grad}
torch.save(lora_state, "lora_weights.pt")

# Load: first apply_lora, then load weights
apply_lora(model, rank=8, alpha=16)
model.load_state_dict(torch.load("lora_weights.pt"), strict=False)

API Reference

LoRA Functions

Function	Description
`apply_lora(model, ...)`	Apply LoRA to model
`merge_lora(model)`	Merge LoRA into base weights
`unmerge_lora(model)`	Undo merge
`get_lora_parameters(model)`	Get trainable params
`disable_lora(model)`	Temporarily disable
`enable_lora(model)`	Re-enable

QLoRA Functions

Function	Description
`apply_qlora(model, ...)`	Apply QLoRA (quantizes base)
`get_qlora_parameters(model)`	Get trainable params
`quantize_nf4(tensor)`	Manual NF4 quantization
`dequantize_nf4(indices, scales, ...)`	Dequantize NF4