96. LoRA: Fine-Tune a Billion-Parameter Model on a Laptop

GPT-2 has 117M parameters. LLaMA-2 has 7B. GPT-3 has 175B.

Full fine-tuning means updating every single parameter. For GPT-2 that's manageable. For LLaMA-2 it needs 28GB of GPU memory just to store the gradients. For GPT-3 it's basically impossible without a cluster.

LoRA (Low-Rank Adaptation) solves this. Instead of updating the full weight matrices, it adds tiny trainable modules next to them. The original weights stay frozen. Only the tiny modules train. At the end you merge them back.

You go from needing 8 A100s to needing a consumer GPU. Or sometimes just a CPU.

What You'll Learn Here

Why full fine-tuning doesn't scale
The math behind LoRA in plain English
Rank, alpha, and dropout: what they control
Which layers to apply LoRA to
Setting up LoRA with HuggingFace PEFT
QLoRA: quantization + LoRA for consumer hardware
Merging LoRA weights for deployment
Comparing LoRA to full fine-tuning

The Problem With Full Fine-Tuning at Scale

# Memory requirements for fine-tuning
def estimate_gpu_memory(n_params_billions, dtype='float32'):
    bytes_per_param = {
        'float32': 4,
        'float16': 2,
        'int8':    1,
        'int4':    0.5
    }

    bpp       = bytes_per_param[dtype]
    model_gb  = n_params_billions * 1e9 * bpp / 1e9

    # For full fine-tuning you also need:
    # - Gradients: same size as weights
    # - Adam optimizer states: 2x weight size
    # - Activations: depends on batch size (rough estimate 2x)
    total_gb = model_gb * (1 + 1 + 2 + 2)   # weights + grads + optimizer + activations

    return model_gb, total_gb

print(f"{'Model':<15} {'Params':<10} {'Weights':<12} {'Full FT Memory'}")
print("-" * 50)
for name, params in [('GPT-2', 0.117), ('LLaMA-7B', 7), ('LLaMA-13B', 13), ('GPT-3', 175)]:
    w_gb, total = estimate_gpu_memory(params, 'float32')
    print(f"{name:<15} {params:<10} {w_gb:.1f} GB      {total:.0f} GB")

Output:

Model           Params     Weights      Full FT Memory
--------------------------------------------------
GPT-2           0.117      0.5 GB      2 GB
LLaMA-7B        7          28.0 GB     168 GB
LLaMA-13B       13         52.0 GB     312 GB
GPT-3           175        700.0 GB    4200 GB

LLaMA-7B full fine-tuning needs 168GB of GPU memory. A single A100 has 80GB. You need at least 3 of them for $30,000+.

LoRA changes this dramatically.

How LoRA Works: The Math

A pretrained weight matrix W has shape (d_out, d_in). Full fine-tuning updates W directly:

W_new = W_pretrained + ΔW

ΔW has the same shape as W. That's the problem. It's huge.

LoRA's insight: the update ΔW doesn't need to have full rank. Most meaningful weight changes during fine-tuning lie in a low-dimensional subspace.

Instead of learning ΔW directly, LoRA approximates it as the product of two small matrices:

ΔW ≈ B × A

where:
  A has shape (r, d_in)   - projects down to rank r
  B has shape (d_out, r)  - projects back up to d_out

r << min(d_in, d_out)

During forward pass:

output = x @ W^T + x @ A^T @ B^T × (alpha/r)
       = (pretrained part) + (LoRA part)

W stays frozen. Only A and B train. Total parameters: r * (d_in + d_out) instead of d_in * d_out.

import torch
import torch.nn as nn
import math

class LoRALayer(nn.Module):
    def __init__(self, original_layer, rank=8, alpha=16, dropout=0.1):
        super().__init__()

        self.original = original_layer
        self.rank     = rank
        self.alpha    = alpha
        self.scaling  = alpha / rank

        # Freeze the original layer
        for param in self.original.parameters():
            param.requires_grad = False

        # LoRA matrices A and B
        in_features  = original_layer.in_features
        out_features = original_layer.out_features

        self.lora_A = nn.Linear(in_features,  rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)

        self.dropout = nn.Dropout(dropout)

        # Initialize: A with Gaussian, B with zeros
        # B=0 means LoRA starts as identity (no change at init)
        nn.init.kaiming_uniform_(self.lora_A.weight, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B.weight)

    def forward(self, x):
        # Original output (frozen)
        original_out = self.original(x)

        # LoRA delta
        lora_out = self.lora_B(self.lora_A(self.dropout(x))) * self.scaling

        return original_out + lora_out

    def parameter_count(self):
        original_params = sum(p.numel() for p in self.original.parameters())
        lora_params     = sum(p.numel() for p in self.lora_A.parameters()) + \
                          sum(p.numel() for p in self.lora_B.parameters())
        return original_params, lora_params

# Test LoRA layer
original_linear = nn.Linear(768, 768)  # typical BERT attention dimension
lora_linear     = LoRALayer(original_linear, rank=8, alpha=16)

original_params, lora_params = lora_linear.parameter_count()
print(f"Original parameters: {original_params:,}")
print(f"LoRA parameters:     {lora_params:,}")
print(f"Parameter reduction: {lora_params/original_params:.1%} of original")

x   = torch.randn(2, 10, 768)
out = lora_linear(x)
print(f"\nInput shape:  {x.shape}")
print(f"Output shape: {out.shape}")

Output:

Original parameters: 590,592
LoRA parameters:     12,288
Parameter reduction: 2.1% of original

Input shape:  torch.Size([2, 10, 768])
Output shape: torch.Size([2, 10, 768])

12,288 parameters instead of 590,592. Same output shape. 2.1% of the original.

Rank, Alpha, and What They Control

import pandas as pd

# How rank affects parameter count for a 768x768 matrix
rows = []
for rank in [1, 2, 4, 8, 16, 32, 64]:
    d_in = d_out = 768
    original = d_in * d_out
    lora     = rank * (d_in + d_out)
    rows.append({
        'Rank': rank,
        'LoRA params': lora,
        'Original params': original,
        '% of original': f"{lora/original:.2%}",
        'Reduction factor': f"{original//lora}x"
    })

print(pd.DataFrame(rows).to_string(index=False))

Output:

 Rank  LoRA params  Original params  % of original  Reduction factor
    1         1536           589824           0.26%            384x
    2         3072           589824           0.52%            192x
    4         6144           589824           1.04%             96x
    8        12288           589824           2.08%             48x
   16        24576           589824           4.17%             24x
   32        49152           589824           8.33%             12x
   64        98304           589824          16.67%              6x

Rank (r): how many dimensions to use in the low-rank approximation. Higher rank = more parameters = more expressive but closer to full fine-tuning.

r=4 or r=8: most common starting point
r=16 to r=32: for harder tasks that need more capacity
r=64+: approaching full fine-tuning territory

Alpha (α): scaling factor for the LoRA output. Controls how much influence LoRA has relative to the frozen model.

Usually set to alpha = rank (scaling = 1.0)
Or alpha = 2 * rank (scaling = 2.0, LoRA has more influence)
Common: rank=8, alpha=16 (scaling=2)

Dropout: regularization inside LoRA. Typically 0.05 to 0.1.

Which Layers to Apply LoRA To

In transformers, the attention mechanism has four weight matrices per layer: Q, K, V, and the output projection. The feed-forward layers have two more.

# Common LoRA target modules for different architectures

lora_targets = {
    'BERT / RoBERTa': {
        'targets': ['query', 'key', 'value', 'dense'],
        'note': 'All attention projections'
    },
    'GPT-2': {
        'targets': ['c_attn', 'c_proj'],
        'note': 'Combined QKV and output projection'
    },
    'LLaMA / Mistral': {
        'targets': ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
        'note': 'All attention projections, sometimes gate_proj too'
    },
    'Minimal (fastest)': {
        'targets': ['q_proj', 'v_proj'],
        'note': 'Only query and value, fewer params but often enough'
    }
}

for arch, info in lora_targets.items():
    print(f"\n{arch}:")
    print(f"  Targets: {info['targets']}")
    print(f"  Note:    {info['note']}")

Research shows that applying LoRA to Q and V only (skipping K) often works nearly as well as all four while using fewer parameters.

LoRA With HuggingFace PEFT

pip install peft

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

model_name = 'roberta-base'
tokenizer  = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=3
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,       # sequence classification
    r=8,                               # rank
    lora_alpha=16,                     # alpha
    lora_dropout=0.1,                  # dropout
    target_modules=['query', 'value'], # apply to Q and V only
    bias='none',                       # don't train biases
    inference_mode=False
)

# Wrap the model with LoRA
model = get_peft_model(base_model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()

Output:

trainable params: 629,764 || all params: 125,277,444 || trainable%: 0.5025

0.5% of parameters. Everything else is frozen.

# Training with LoRA is identical to regular fine-tuning
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
from datasets import load_dataset
import evaluate
import numpy as np

# Load data
dataset   = load_dataset('imdb')
small_train = dataset['train'].select(range(2000))
small_val   = dataset['test'].select(range(500))

def tokenize(examples):
    return tokenizer(examples['text'], truncation=True, max_length=256)

train_ds = small_train.map(tokenize, batched=True, remove_columns=['text'])
val_ds   = small_val.map(tokenize,   batched=True, remove_columns=['text'])
train_ds = train_ds.rename_column('label', 'labels')
val_ds   = val_ds.rename_column('label', 'labels')

accuracy = evaluate.load('accuracy')
def compute_metrics(eval_pred):
    preds = np.argmax(eval_pred.predictions, axis=-1)
    return accuracy.compute(predictions=preds, references=eval_pred.label_ids)

training_args = TrainingArguments(
    output_dir='./lora_model',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=3e-4,          # LoRA can use higher LR than full fine-tuning
    weight_decay=0.01,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    report_to='none',
    fp16=torch.cuda.is_available()
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics
)

trainer.train()
results = trainer.evaluate()
print(f"LoRA fine-tuning accuracy: {results['eval_accuracy']:.3f}")

Saving and Loading LoRA Weights

LoRA's other big advantage: the saved checkpoint is tiny. You only save the LoRA matrices, not the full model.

from peft import PeftModel

# Save only the LoRA weights
model.save_pretrained('./lora_weights')   # saves adapter_config.json and adapter_model.bin
print("LoRA weights saved")

import os
for f in os.listdir('./lora_weights'):
    size = os.path.getsize(f'./lora_weights/{f}') / 1e6
    print(f"  {f}: {size:.1f} MB")

Output:

LoRA weights saved
  adapter_config.json: 0.001 MB
  adapter_model.bin: 2.4 MB     <- only 2.4 MB instead of 500+ MB!

# Load: start with base model, then load LoRA adapter
base_model_for_load = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=3
)
loaded_lora_model = PeftModel.from_pretrained(base_model_for_load, './lora_weights')
loaded_lora_model.eval()
print("LoRA model loaded successfully")

Merging LoRA for Deployment

After training, you can merge the LoRA weights into the base model. Then you have one clean model with no overhead at inference time.

# Merge LoRA into base model
merged_model = model.merge_and_unload()

# Now merged_model is a regular model with no LoRA overhead
print(f"Type after merge: {type(merged_model)}")

# Save the merged model
merged_model.save_pretrained('./merged_model')
tokenizer.save_pretrained('./merged_model')

# Load it like any normal model
from transformers import AutoModelForSequenceClassification
final_model = AutoModelForSequenceClassification.from_pretrained('./merged_model')
print("Merged model loaded as regular model")

# Check: no LoRA parameters, just the full model
n_params = sum(p.numel() for p in final_model.parameters())
print(f"Parameters: {n_params:,}")

QLoRA: 4-bit Quantization + LoRA

QLoRA combines quantization (reducing weight precision to 4-bit) with LoRA. This lets you fine-tune 7B+ models on a single consumer GPU.

pip install bitsandbytes

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # quantize to 4-bit
    bnb_4bit_quant_type='nf4',            # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.float16, # compute in fp16
    bnb_4bit_use_double_quant=True        # double quantization (saves more memory)
)

# Load model in 4-bit (much less memory)
model_name = 'gpt2'   # swap with 'meta-llama/Llama-2-7b-hf' if you have access

qlora_base = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map='auto'          # automatically handles multi-GPU or CPU offload
)

# Required for 4-bit training
qlora_base.config.use_cache           = False
qlora_base.config.pretraining_tp      = 1

# Prepare for LoRA training with quantized model
from peft import prepare_model_for_kbit_training
qlora_base = prepare_model_for_kbit_training(qlora_base)

# Apply LoRA config
qlora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=['c_attn', 'c_proj'],  # GPT-2 specific
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.CAUSAL_LM
)

qlora_model = get_peft_model(qlora_base, qlora_config)
qlora_model.print_trainable_parameters()

Output:

trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.2364

# Memory savings with QLoRA
memory_estimates = {
    'Full fine-tuning (fp32)':     '~28 GB for 7B model',
    'Full fine-tuning (fp16)':     '~14 GB for 7B model',
    'LoRA (fp16)':                 '~8 GB for 7B model',
    'QLoRA (4-bit + LoRA)':       '~4 GB for 7B model',
}

print("Memory requirements for 7B parameter model:")
for method, memory in memory_estimates.items():
    print(f"  {method:<35}: {memory}")

LoRA vs Full Fine-Tuning: Benchmark Comparison

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer, TrainingArguments, Trainer,
    DataCollatorWithPadding
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
import evaluate, numpy as np, time, torch

model_name = 'distilbert-base-uncased'
tokenizer  = AutoTokenizer.from_pretrained(model_name)

dataset    = load_dataset('imdb')
small_train = dataset['train'].select(range(1000))
small_val   = dataset['test'].select(range(300))

def tokenize(examples):
    return tokenizer(examples['text'], truncation=True, max_length=128)

train_ds = small_train.map(tokenize, batched=True, remove_columns=['text'])
val_ds   = small_val.map(tokenize,   batched=True, remove_columns=['text'])
train_ds = train_ds.rename_column('label', 'labels')
val_ds   = val_ds.rename_column('label', 'labels')

accuracy = evaluate.load('accuracy')
def compute_metrics(eval_pred):
    preds = np.argmax(eval_pred.predictions, axis=-1)
    return accuracy.compute(predictions=preds, references=eval_pred.label_ids)

def run_experiment(use_lora, rank=8):
    base = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    if use_lora:
        config = LoraConfig(
            task_type=TaskType.SEQ_CLS, r=rank,
            lora_alpha=rank*2, lora_dropout=0.1,
            target_modules=['q_lin', 'v_lin'], bias='none'
        )
        model = get_peft_model(base, config)
        lr    = 3e-4
    else:
        model = base
        lr    = 2e-5

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total     = sum(p.numel() for p in model.parameters())

    args = TrainingArguments(
        output_dir=f'./exp_{"lora" if use_lora else "full"}',
        num_train_epochs=3,
        per_device_train_batch_size=16,
        learning_rate=lr,
        evaluation_strategy='epoch',
        report_to='none',
        logging_steps=999
    )

    trainer = Trainer(
        model=model, args=args,
        train_dataset=train_ds, eval_dataset=val_ds,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer),
        compute_metrics=compute_metrics
    )

    start   = time.time()
    trainer.train()
    elapsed = time.time() - start
    results = trainer.evaluate()

    return {
        'method':     f'LoRA (r={rank})' if use_lora else 'Full fine-tuning',
        'trainable':  f'{trainable:,} ({trainable/total:.1%})',
        'accuracy':   f"{results['eval_accuracy']:.3f}",
        'time_s':     f"{elapsed:.0f}s"
    }

print("Running comparison (this takes a few minutes)...")
results = [
    run_experiment(use_lora=False),
    run_experiment(use_lora=True, rank=4),
    run_experiment(use_lora=True, rank=8),
    run_experiment(use_lora=True, rank=16),
]

print(f"\n{'Method':<20} {'Trainable Params':<25} {'Accuracy':<12} {'Time'}")
print("-" * 70)
for r in results:
    print(f"{r['method']:<20} {r['trainable']:<25} {r['accuracy']:<12} {r['time_s']}")

Typical output:

Method               Trainable Params          Accuracy     Time
----------------------------------------------------------------------
Full fine-tuning     66,955,010 (100%)         0.934        148s
LoRA (r=4)           147,968 (0.22%)           0.921        102s
LoRA (r=8)           295,168 (0.44%)           0.928        108s
LoRA (r=16)          589,824 (0.88%)           0.931        115s

LoRA with r=8 gets 99.4% of full fine-tuning accuracy with 0.44% of the parameters and 73% of the training time. For larger models, the savings are even more dramatic.

When to Use LoRA vs Full Fine-Tuning

Use LoRA when:
  - Model is large (> 1B parameters)
  - GPU memory is limited
  - You want to share adapters separately from the base model
  - You want to try many different tasks with one base model
  - Quick iteration is more important than peak accuracy

Use full fine-tuning when:
  - Model is small (< 500M parameters)
  - You have plenty of GPU memory
  - Peak accuracy matters more than speed
  - You only have one task to fine-tune for
  - You'll merge and ship a single final model

Quick Cheat Sheet

Concept	What it means
Rank (r)	Dimensions of LoRA matrices. r=8 is a good default.
Alpha (α)	Scaling. Set to 2*r or same as r.
Target modules	Which weight matrices to apply LoRA to. Start with Q and V.
Scaling factor	alpha/rank. Controls LoRA strength.
Merge and unload	Bake LoRA into base weights. One clean model for deployment.
QLoRA	4-bit quantization + LoRA. Fine-tune 7B on 4GB GPU.

Task	Code
Configure LoRA	`LoraConfig(r=8, lora_alpha=16, target_modules=[...])`
Apply to model	`get_peft_model(base_model, lora_config)`
Check params	`model.print_trainable_parameters()`
Save adapters	`model.save_pretrained('./lora_weights')`
Load adapters	`PeftModel.from_pretrained(base_model, './lora_weights')`
Merge weights	`model.merge_and_unload()`
QLoRA setup	`BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4')`

Practice Challenges

Level 1:
Apply LoRA to distilbert-base-uncased for a 3-class classification task. Use r=4 then r=16. Print the trainable parameter counts for both. Fine-tune each for 2 epochs. Compare accuracy vs parameter count.

Level 2:
Fine-tune the same dataset three ways: full fine-tuning, LoRA with r=8, and frozen backbone (only train the classification head). Plot a bar chart comparing accuracy, training time, and trainable parameter count for all three approaches.

Level 3:
Set up QLoRA with bitsandbytes on any GPT-style model. Verify it loads in 4-bit. Fine-tune on a small instruction dataset for 1 epoch. Generate 5 responses and compare quality to the non-fine-tuned base model. Report GPU memory usage before and after loading.

References

Next up, Post 97: Embeddings and Vector Search: Semantic Search That Works. How to turn sentences into vectors, find similar content with cosine similarity, and build a semantic search engine with FAISS or ChromaDB.

推荐订阅源

DEV Community

What You'll Learn Here

The Problem With Full Fine-Tuning at Scale

How LoRA Works: The Math

Rank, Alpha, and What They Control

Which Layers to Apply LoRA To

LoRA With HuggingFace PEFT

Saving and Loading LoRA Weights

Merging LoRA for Deployment

QLoRA: 4-bit Quantization + LoRA

LoRA vs Full Fine-Tuning: Benchmark Comparison

When to Use LoRA vs Full Fine-Tuning

Quick Cheat Sheet

Practice Challenges

References