GPT-2 has 117M parameters. LLaMA-2 has 7B. GPT-3 has 175B.
Full fine-tuning means updating every single parameter. For GPT-2 that's manageable. For LLaMA-2 it needs 28GB of GPU memory just to store the gradients. For GPT-3 it's basically impossible without a cluster.
LoRA (Low-Rank Adaptation) solves this. Instead of updating the full weight matrices, it adds tiny trainable modules next to them. The original weights stay frozen. Only the tiny modules train. At the end you merge them back.
You go from needing 8 A100s to needing a consumer GPU. Or sometimes just a CPU.
What You'll Learn Here
- Why full fine-tuning doesn't scale
- The math behind LoRA in plain English
- Rank, alpha, and dropout: what they control
- Which layers to apply LoRA to
- Setting up LoRA with HuggingFace PEFT
- QLoRA: quantization + LoRA for consumer hardware
- Merging LoRA weights for deployment
- Comparing LoRA to full fine-tuning
The Problem With Full Fine-Tuning at Scale
# Memory requirements for fine-tuning
def estimate_gpu_memory(n_params_billions, dtype='float32'):
bytes_per_param = {
'float32': 4,
'float16': 2,
'int8': 1,
'int4': 0.5
}
bpp = bytes_per_param[dtype]
model_gb = n_params_billions * 1e9 * bpp / 1e9
# For full fine-tuning you also need:
# - Gradients: same size as weights
# - Adam optimizer states: 2x weight size
# - Activations: depends on batch size (rough estimate 2x)
total_gb = model_gb * (1 + 1 + 2 + 2) # weights + grads + optimizer + activations
return model_gb, total_gb
print(f"{'Model':<15} {'Params':<10} {'Weights':<12} {'Full FT Memory'}")
print("-" * 50)
for name, params in [('GPT-2', 0.117), ('LLaMA-7B', 7), ('LLaMA-13B', 13), ('GPT-3', 175)]:
w_gb, total = estimate_gpu_memory(params, 'float32')
print(f"{name:<15} {params:<10} {w_gb:.1f} GB {total:.0f} GB")
Output:
Model Params Weights Full FT Memory
--------------------------------------------------
GPT-2 0.117 0.5 GB 2 GB
LLaMA-7B 7 28.0 GB 168 GB
LLaMA-13B 13 52.0 GB 312 GB
GPT-3 175 700.0 GB 4200 GB
LLaMA-7B full fine-tuning needs 168GB of GPU memory. A single A100 has 80GB. You need at least 3 of them for $30,000+.
LoRA changes this dramatically.
How LoRA Works: The Math
A pretrained weight matrix W has shape (d_out, d_in). Full fine-tuning updates W directly:
W_new = W_pretrained + ΔW
ΔW has the same shape as W. That's the problem. It's huge.
LoRA's insight: the update ΔW doesn't need to have full rank. Most meaningful weight changes during fine-tuning lie in a low-dimensional subspace.
Instead of learning ΔW directly, LoRA approximates it as the product of two small matrices:
ΔW ≈ B × A
where:
A has shape (r, d_in) - projects down to rank r
B has shape (d_out, r) - projects back up to d_out
r << min(d_in, d_out)
During forward pass:
output = x @ W^T + x @ A^T @ B^T × (alpha/r)
= (pretrained part) + (LoRA part)
W stays frozen. Only A and B train. Total parameters: r * (d_in + d_out) instead of d_in * d_out.
import torch
import torch.nn as nn
import math
class LoRALayer(nn.Module):
def __init__(self, original_layer, rank=8, alpha=16, dropout=0.1):
super().__init__()
self.original = original_layer
self.rank = rank
self.alpha = alpha
self.scaling = alpha / rank
# Freeze the original layer
for param in self.original.parameters():
param.requires_grad = False
# LoRA matrices A and B
in_features = original_layer.in_features
out_features = original_layer.out_features
self.lora_A = nn.Linear(in_features, rank, bias=False)
self.lora_B = nn.Linear(rank, out_features, bias=False)
self.dropout = nn.Dropout(dropout)
# Initialize: A with Gaussian, B with zeros
# B=0 means LoRA starts as identity (no change at init)
nn.init.kaiming_uniform_(self.lora_A.weight, a=math.sqrt(5))
nn.init.zeros_(self.lora_B.weight)
def forward(self, x):
# Original output (frozen)
original_out = self.original(x)
# LoRA delta
lora_out = self.lora_B(self.lora_A(self.dropout(x))) * self.scaling
return original_out + lora_out
def parameter_count(self):
original_params = sum(p.numel() for p in self.original.parameters())
lora_params = sum(p.numel() for p in self.lora_A.parameters()) + \
sum(p.numel() for p in self.lora_B.parameters())
return original_params, lora_params
# Test LoRA layer
original_linear = nn.Linear(768, 768) # typical BERT attention dimension
lora_linear = LoRALayer(original_linear, rank=8, alpha=16)
original_params, lora_params = lora_linear.parameter_count()
print(f"Original parameters: {original_params:,}")
print(f"LoRA parameters: {lora_params:,}")
print(f"Parameter reduction: {lora_params/original_params:.1%} of original")
x = torch.randn(2, 10, 768)
out = lora_linear(x)
print(f"\nInput shape: {x.shape}")
print(f"Output shape: {out.shape}")
Output:
Original parameters: 590,592
LoRA parameters: 12,288
Parameter reduction: 2.1% of original
Input shape: torch.Size([2, 10, 768])
Output shape: torch.Size([2, 10, 768])
12,288 parameters instead of 590,592. Same output shape. 2.1% of the original.
Rank, Alpha, and What They Control
import pandas as pd
# How rank affects parameter count for a 768x768 matrix
rows = []
for rank in [1, 2, 4, 8, 16, 32, 64]:
d_in = d_out = 768
original = d_in * d_out
lora = rank * (d_in + d_out)
rows.append({
'Rank': rank,
'LoRA params': lora,
'Original params': original,
'% of original': f"{lora/original:.2%}",
'Reduction factor': f"{original//lora}x"
})
print(pd.DataFrame(rows).to_string(index=False))
Output:
Rank LoRA params Original params % of original Reduction factor
1 1536 589824 0.26% 384x
2 3072 589824 0.52% 192x
4 6144 589824 1.04% 96x
8 12288 589824 2.08% 48x
16 24576 589824 4.17% 24x
32 49152 589824 8.33% 12x
64 98304 589824 16.67% 6x
Rank (r): how many dimensions to use in the low-rank approximation. Higher rank = more parameters = more expressive but closer to full fine-tuning.
- r=4 or r=8: most common starting point
- r=16 to r=32: for harder tasks that need more capacity
- r=64+: approaching full fine-tuning territory
Alpha (α): scaling factor for the LoRA output. Controls how much influence LoRA has relative to the frozen model.
- Usually set to alpha = rank (scaling = 1.0)
- Or alpha = 2 * rank (scaling = 2.0, LoRA has more influence)
- Common: rank=8, alpha=16 (scaling=2)
Dropout: regularization inside LoRA. Typically 0.05 to 0.1.
Which Layers to Apply LoRA To
In transformers, the attention mechanism has four weight matrices per layer: Q, K, V, and the output projection. The feed-forward layers have two more.
# Common LoRA target modules for different architectures
lora_targets = {
'BERT / RoBERTa': {
'targets': ['query', 'key', 'value', 'dense'],
'note': 'All attention projections'
},
'GPT-2': {
'targets': ['c_attn', 'c_proj'],
'note': 'Combined QKV and output projection'
},
'LLaMA / Mistral': {
'targets': ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
'note': 'All attention projections, sometimes gate_proj too'
},
'Minimal (fastest)': {
'targets': ['q_proj', 'v_proj'],
'note': 'Only query and value, fewer params but often enough'
}
}
for arch, info in lora_targets.items():
print(f"\n{arch}:")
print(f" Targets: {info['targets']}")
print(f" Note: {info['note']}")
Research shows that applying LoRA to Q and V only (skipping K) often works nearly as well as all four while using fewer parameters.
LoRA With HuggingFace PEFT
pip install peft
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch
model_name = 'roberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=3
)
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.SEQ_CLS, # sequence classification
r=8, # rank
lora_alpha=16, # alpha
lora_dropout=0.1, # dropout
target_modules=['query', 'value'], # apply to Q and V only
bias='none', # don't train biases
inference_mode=False
)
# Wrap the model with LoRA
model = get_peft_model(base_model, lora_config)
# Check trainable parameters
model.print_trainable_parameters()
Output:
trainable params: 629,764 || all params: 125,277,444 || trainable%: 0.5025
0.5% of parameters. Everything else is frozen.
# Training with LoRA is identical to regular fine-tuning
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
from datasets import load_dataset
import evaluate
import numpy as np
# Load data
dataset = load_dataset('imdb')
small_train = dataset['train'].select(range(2000))
small_val = dataset['test'].select(range(500))
def tokenize(examples):
return tokenizer(examples['text'], truncation=True, max_length=256)
train_ds = small_train.map(tokenize, batched=True, remove_columns=['text'])
val_ds = small_val.map(tokenize, batched=True, remove_columns=['text'])
train_ds = train_ds.rename_column('label', 'labels')
val_ds = val_ds.rename_column('label', 'labels')
accuracy = evaluate.load('accuracy')
def compute_metrics(eval_pred):
preds = np.argmax(eval_pred.predictions, axis=-1)
return accuracy.compute(predictions=preds, references=eval_pred.label_ids)
training_args = TrainingArguments(
output_dir='./lora_model',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=3e-4, # LoRA can use higher LR than full fine-tuning
weight_decay=0.01,
evaluation_strategy='epoch',
save_strategy='epoch',
load_best_model_at_end=True,
report_to='none',
fp16=torch.cuda.is_available()
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=val_ds,
tokenizer=tokenizer,
data_collator=DataCollatorWithPadding(tokenizer),
compute_metrics=compute_metrics
)
trainer.train()
results = trainer.evaluate()
print(f"LoRA fine-tuning accuracy: {results['eval_accuracy']:.3f}")
Saving and Loading LoRA Weights
LoRA's other big advantage: the saved checkpoint is tiny. You only save the LoRA matrices, not the full model.
from peft import PeftModel
# Save only the LoRA weights
model.save_pretrained('./lora_weights') # saves adapter_config.json and adapter_model.bin
print("LoRA weights saved")
import os
for f in os.listdir('./lora_weights'):
size = os.path.getsize(f'./lora_weights/{f}') / 1e6
print(f" {f}: {size:.1f} MB")
Output:
LoRA weights saved
adapter_config.json: 0.001 MB
adapter_model.bin: 2.4 MB <- only 2.4 MB instead of 500+ MB!
# Load: start with base model, then load LoRA adapter
base_model_for_load = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=3
)
loaded_lora_model = PeftModel.from_pretrained(base_model_for_load, './lora_weights')
loaded_lora_model.eval()
print("LoRA model loaded successfully")
Merging LoRA for Deployment
After training, you can merge the LoRA weights into the base model. Then you have one clean model with no overhead at inference time.
# Merge LoRA into base model
merged_model = model.merge_and_unload()
# Now merged_model is a regular model with no LoRA overhead
print(f"Type after merge: {type(merged_model)}")
# Save the merged model
merged_model.save_pretrained('./merged_model')
tokenizer.save_pretrained('./merged_model')
# Load it like any normal model
from transformers import AutoModelForSequenceClassification
final_model = AutoModelForSequenceClassification.from_pretrained('./merged_model')
print("Merged model loaded as regular model")
# Check: no LoRA parameters, just the full model
n_params = sum(p.numel() for p in final_model.parameters())
print(f"Parameters: {n_params:,}")
QLoRA: 4-bit Quantization + LoRA
QLoRA combines quantization (reducing weight precision to 4-bit) with LoRA. This lets you fine-tune 7B+ models on a single consumer GPU.
pip install bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # quantize to 4-bit
bnb_4bit_quant_type='nf4', # NormalFloat4 quantization
bnb_4bit_compute_dtype=torch.float16, # compute in fp16
bnb_4bit_use_double_quant=True # double quantization (saves more memory)
)
# Load model in 4-bit (much less memory)
model_name = 'gpt2' # swap with 'meta-llama/Llama-2-7b-hf' if you have access
qlora_base = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map='auto' # automatically handles multi-GPU or CPU offload
)
# Required for 4-bit training
qlora_base.config.use_cache = False
qlora_base.config.pretraining_tp = 1
# Prepare for LoRA training with quantized model
from peft import prepare_model_for_kbit_training
qlora_base = prepare_model_for_kbit_training(qlora_base)
# Apply LoRA config
qlora_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=['c_attn', 'c_proj'], # GPT-2 specific
lora_dropout=0.05,
bias='none',
task_type=TaskType.CAUSAL_LM
)
qlora_model = get_peft_model(qlora_base, qlora_config)
qlora_model.print_trainable_parameters()
Output:
trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.2364
# Memory savings with QLoRA
memory_estimates = {
'Full fine-tuning (fp32)': '~28 GB for 7B model',
'Full fine-tuning (fp16)': '~14 GB for 7B model',
'LoRA (fp16)': '~8 GB for 7B model',
'QLoRA (4-bit + LoRA)': '~4 GB for 7B model',
}
print("Memory requirements for 7B parameter model:")
for method, memory in memory_estimates.items():
print(f" {method:<35}: {memory}")
LoRA vs Full Fine-Tuning: Benchmark Comparison
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer, TrainingArguments, Trainer,
DataCollatorWithPadding
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
import evaluate, numpy as np, time, torch
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
dataset = load_dataset('imdb')
small_train = dataset['train'].select(range(1000))
small_val = dataset['test'].select(range(300))
def tokenize(examples):
return tokenizer(examples['text'], truncation=True, max_length=128)
train_ds = small_train.map(tokenize, batched=True, remove_columns=['text'])
val_ds = small_val.map(tokenize, batched=True, remove_columns=['text'])
train_ds = train_ds.rename_column('label', 'labels')
val_ds = val_ds.rename_column('label', 'labels')
accuracy = evaluate.load('accuracy')
def compute_metrics(eval_pred):
preds = np.argmax(eval_pred.predictions, axis=-1)
return accuracy.compute(predictions=preds, references=eval_pred.label_ids)
def run_experiment(use_lora, rank=8):
base = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
if use_lora:
config = LoraConfig(
task_type=TaskType.SEQ_CLS, r=rank,
lora_alpha=rank*2, lora_dropout=0.1,
target_modules=['q_lin', 'v_lin'], bias='none'
)
model = get_peft_model(base, config)
lr = 3e-4
else:
model = base
lr = 2e-5
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
args = TrainingArguments(
output_dir=f'./exp_{"lora" if use_lora else "full"}',
num_train_epochs=3,
per_device_train_batch_size=16,
learning_rate=lr,
evaluation_strategy='epoch',
report_to='none',
logging_steps=999
)
trainer = Trainer(
model=model, args=args,
train_dataset=train_ds, eval_dataset=val_ds,
tokenizer=tokenizer,
data_collator=DataCollatorWithPadding(tokenizer),
compute_metrics=compute_metrics
)
start = time.time()
trainer.train()
elapsed = time.time() - start
results = trainer.evaluate()
return {
'method': f'LoRA (r={rank})' if use_lora else 'Full fine-tuning',
'trainable': f'{trainable:,} ({trainable/total:.1%})',
'accuracy': f"{results['eval_accuracy']:.3f}",
'time_s': f"{elapsed:.0f}s"
}
print("Running comparison (this takes a few minutes)...")
results = [
run_experiment(use_lora=False),
run_experiment(use_lora=True, rank=4),
run_experiment(use_lora=True, rank=8),
run_experiment(use_lora=True, rank=16),
]
print(f"\n{'Method':<20} {'Trainable Params':<25} {'Accuracy':<12} {'Time'}")
print("-" * 70)
for r in results:
print(f"{r['method']:<20} {r['trainable']:<25} {r['accuracy']:<12} {r['time_s']}")
Typical output:
Method Trainable Params Accuracy Time
----------------------------------------------------------------------
Full fine-tuning 66,955,010 (100%) 0.934 148s
LoRA (r=4) 147,968 (0.22%) 0.921 102s
LoRA (r=8) 295,168 (0.44%) 0.928 108s
LoRA (r=16) 589,824 (0.88%) 0.931 115s
LoRA with r=8 gets 99.4% of full fine-tuning accuracy with 0.44% of the parameters and 73% of the training time. For larger models, the savings are even more dramatic.
When to Use LoRA vs Full Fine-Tuning
Use LoRA when:
- Model is large (> 1B parameters)
- GPU memory is limited
- You want to share adapters separately from the base model
- You want to try many different tasks with one base model
- Quick iteration is more important than peak accuracy
Use full fine-tuning when:
- Model is small (< 500M parameters)
- You have plenty of GPU memory
- Peak accuracy matters more than speed
- You only have one task to fine-tune for
- You'll merge and ship a single final model
Quick Cheat Sheet
| Concept | What it means |
|---|---|
| Rank (r) | Dimensions of LoRA matrices. r=8 is a good default. |
| Alpha (α) | Scaling. Set to 2*r or same as r. |
| Target modules | Which weight matrices to apply LoRA to. Start with Q and V. |
| Scaling factor | alpha/rank. Controls LoRA strength. |
| Merge and unload | Bake LoRA into base weights. One clean model for deployment. |
| QLoRA | 4-bit quantization + LoRA. Fine-tune 7B on 4GB GPU. |
| Task | Code |
|---|---|
| Configure LoRA | LoraConfig(r=8, lora_alpha=16, target_modules=[...]) |
| Apply to model | get_peft_model(base_model, lora_config) |
| Check params | model.print_trainable_parameters() |
| Save adapters | model.save_pretrained('./lora_weights') |
| Load adapters | PeftModel.from_pretrained(base_model, './lora_weights') |
| Merge weights | model.merge_and_unload() |
| QLoRA setup | BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4') |
Practice Challenges
Level 1:
Apply LoRA to distilbert-base-uncased for a 3-class classification task. Use r=4 then r=16. Print the trainable parameter counts for both. Fine-tune each for 2 epochs. Compare accuracy vs parameter count.
Level 2:
Fine-tune the same dataset three ways: full fine-tuning, LoRA with r=8, and frozen backbone (only train the classification head). Plot a bar chart comparing accuracy, training time, and trainable parameter count for all three approaches.
Level 3:
Set up QLoRA with bitsandbytes on any GPT-style model. Verify it loads in 4-bit. Fine-tune on a small instruction dataset for 1 epoch. Generate 5 responses and compare quality to the non-fine-tuned base model. Report GPU memory usage before and after loading.
References
- LoRA paper: Low-Rank Adaptation of Large Language Models
- QLoRA paper: Efficient Finetuning of Quantized LLMs
- HuggingFace PEFT docs
- HuggingFace: LoRA training guide
- bitsandbytes: quantization library
Next up, Post 97: Embeddings and Vector Search: Semantic Search That Works. How to turn sentences into vectors, find similar content with cosine similarity, and build a semantic search engine with FAISS or ChromaDB.
























