惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

C
Comments on: Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
李成银的技术随笔
美团技术团队
博客园 - 三生石上(FineUI控件)
爱范儿
爱范儿
Simon Willison's Weblog
Simon Willison's Weblog
Cisco Talos Blog
Cisco Talos Blog
博客园 - 司徒正美
Jina AI
Jina AI
S
SegmentFault 最新的问题
Recorded Future
Recorded Future
大猫的无限游戏
大猫的无限游戏
月光博客
月光博客
E
Exploit-DB.com RSS Feed
J
Java Code Geeks
腾讯CDC
V
V2EX
NISL@THU
NISL@THU
M
MIT News - Artificial intelligence
量子位
T
Tor Project blog
T
Threatpost
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
博客园 - Franky
Scott Helme
Scott Helme
U
Unit 42
博客园 - 聂微东
Hacker News - Newest:
Hacker News - Newest: "LLM"
雷峰网
雷峰网
Vercel News
Vercel News
GbyAI
GbyAI
MyScale Blog
MyScale Blog
Microsoft Security Blog
Microsoft Security Blog
Recent Commits to openclaw:main
Recent Commits to openclaw:main
aimingoo的专栏
aimingoo的专栏
H
Hackread – Cybersecurity News, Data Breaches, AI and More
有赞技术团队
有赞技术团队
W
WeLiveSecurity
T
Tailwind CSS Blog
S
Schneier on Security
Hugging Face - Blog
Hugging Face - Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Y
Y Combinator Blog
I
Intezer
Last Week in AI
Last Week in AI
D
Darknet – Hacking Tools, Hacker News & Cyber Security

DEV Community

Gemma 4 challenge inspired me to build my first app! From a Student Who Used CircuitVerse to a GSoC Contributor — My Community Bonding Story How Bf-Tree Keeps Mini-Pages Small, Hot, and Cheap to Evict I asked Claude to explain the chip war and ended up understanding modern geopolitics differently Stop Manually Checking for Server Updates: Automate With Email Notifications Nostalgia Meets Cybersecurity: Spotting Modern Scams in a Retro OS Simulator - Forward or Fraud CRACKING CODING INTERVIEW From Python to Production Pipeline :A Practical guide to Apache Airflow Antigravity 2.0: Google Just Changed What It Means to Be an Engineer I Built a Free Sticker Maker Because Every Other One Hid the Export How I bypassed Blazor WebAssembly's Virtual DOM using raw WASM pointers Distributed Tracing for LLM Agents: When MCP Makes Tool Calls Observable The Zero-Budget Memory Setup Behind My AI Agent Workflow No database. No framework. Just files, startup order, correction logs, and discipline. I Built an AI Second Brain with Gemma 4 The Most Exciting Google I/O 2026 Announcement for Me: HTML-in-Canvas CrisisLens: Compressing Disaster Scenes into 200-Byte Emergency Payloads with Gemma 4 I'm 15 and I built a todo app with Telegram Stars payments — only legal way for me to monetize before turning 18 Crypto Branding After the Token Launch Building an on-chain alerts bot in Python without any blockchain library FinePrint — An AI Pocket Lawyer That Decodes Predatory Contracts Using Gemma 4 How to Connect OpenAI with Supabase in 10 Minutes for a Lightning-Fast AI MVP One AI Gateway for AWS Bedrock, Google Vertex AI, Gemini, and Anthropic Reading Log #9 — Aoashi The Tacit Dimension Thinking, Fast and Slow Web3 Onboarding Is Not a Wallet Problem. It Is a Trust Problem. FHE Prompt Privacy: The Metadata Leak Your Demo Still Has Software Might Be Becoming Agent-Aware: What if software starts coordinating itself? The Silent Killers of Go Concurrency: Mutexes, Semaphores, and Goroutine Leaks Lynx framework first look Building Aries AI: A Solo-Built AI Abacus Tutor on OpenAI + Supabase + Render + Razorpay I built a paid Telegram bot. Here's what Telegram Stars actually pay. Transfer Fees, Metadata, and Soulbound Tokens: A Tour of Solana Token Extensions Improving AI resume matching with prompt iteration — 7.37 to 8.37/10 7 things you can do with Rogue Studio that no other AI IDE will let you do Why I Think WordPress Still Matters Reading Log #7 — Aoashi Guns, Germs, and Steel Distinction Open Models and the Sub-Saharan Region What 12 Months of AI-Generated Pull Requests Taught My Engineering Team Feature Flags in .NET 8: ASP.NET Core, Minimal APIs, Blazor The Quiet Architecture of Systems That Refuse to Die From OOP to SOLID: Everything You Need to Know in One Article I Scanned 5 Common LangChain Agent Patterns. Every Single One Was Over-Permissioned. Production-Ready MCP Servers in 60 Seconds (Auth, Rate Limits, Audit Logs Included) Dari OOP ke SOLID: Semua yang Perlu Kamu Tahu dalam Satu Artikel The Most Important Part of Google I/O 2026 Wasn’t a Model — It Was the Infrastructure When SafetyCo Goes to War: Anthropic, the DOD, and the Limits of Ideals-Based Frameworks Why AI Memory Resolves Too Much — And What to Preserve Instead What Gemma 4 Means for the Future of Local AI (And Why It Matters More Than GPT-5) The Classroom Gap: Why Applied AI Has Yet to Transform How the World Learns Cell-to-Sentence (C2S): LLM-Powered scRNA-seq Annotation with Gemma 4 GitHub rust-2026-template — my Rust starter in 2026 Stop Editing JSON by Hand How I Turned an Old Movie Recommendation Project Into a Cinematic AI Platform Linux Command Line: The 25 Commands I Use Every Day (2026) The Multilingual SEO Trap: When Your Meta Description Speaks the Wrong Language young-colleague-job-worries What I Learned About Token Design on Solana as a Web2 Developer 19/30 Days System Design Questions! My first Android App - NightLock Tabula vs Camelot vs pdfplumber in 2026: Which Python Library Actually Wins? AI Agent Failure Loops: When Persistence Becomes a Quality Bug Experienced devs are slower with AI and they don't even know it Building a No-KYC Poker Bot: What I Learned Automating Crypto Tables React.lazy + chunk errors: how to recover users stuck after a deploy How I Built Clinical Trials API - From Public Data to RapidAPI in 2 Weeks Where is the Code Editor?! - Reception for Antigravity 2.0 I built a tool to catch AI coding agents misbehaving — and put zero AI in it Reading Log #5 — Aoashi Seeing Like a State Distinction [Boost] How to Build a Clinical Trial Search App in 5 Minutes - Clinical Trials API Tutorial Gemma For Dummies: I Knew Nothing. Now I'm Running AI on My Laptop. I gave an AI a Kill Switch. Here's what I learned about trust in local-first tooling. Notification System Technical Specification What ElumKit v0.1 already does (and the one primitive I missed) Why Every Student Developer Should Know About Microsoft Imagine Cup 🚀 Mikplanu: Empowering Education through Edge AI Sovereignty 터미널 AI 에이전트 구축 (v9) What If Your Portfolio Verifier Could Actually See Your UI? Node.js Event Loop Architecture — How a Single-Threaded Runtime Handles Massive Concurrency From Concept to Code: Bringing Your Vision to Life with Michael K. Laweh Caching Layers in 2026: CDN, App, DB, Query: What Goes Where Stop Wasting Tokens on Android Automation Building a GamepadTester: A Developer’s Perspective on Reading Controller Input in the Browser Your Inbox Knows Too Much: Parsli for the Privacy Paranoid I Ran Every Gemma 4 Model on My Home Lab. E4B Crushes E2B. Here's the Data. How I Use an Online TI-84 Calculator for Quick Math While Coding Building a Blog Platform with Docker #5: Add a Dockerfile + Deploy to Clouderized I Scanned 10 Popular F-Droid Apps With My Security Scanner — Open Source Secure How Microsoft Azure Ensures Reliability, Scalability, and Business Continuity Shelfie: I Built a Book Scanner That Runs Entirely on a $75 Raspberry Pi (Using Gemma 4) Beyond the Hype: Why Google AI Studio Might Become the Bridge Between African Ideas and Global Innovation I built a GitHub Action that blocks PRs when your Figma file is over budget eBPF From Scratch: from the eBPF VM to writing your own tools (tested on a live Cilium cluster) The Case of the Misidentified null A Decade After: Why We Still Can't Get the Treasure Hunt Engine Right I Solved 512+ LeetCode Problems, and Here’s What I Learned 🧠 Deeper into Dataform 2: Other API features Deeper into Dataform 1: Exploring the API Guild — A Free Autonomous Coding Agent That Escalates Through Gemma 4 Models The Web Should Become a VM rabbitholes: how I built a Chrome extension with no server to touch your data
96. LoRA: Fine-Tune a Billion-Parameter Model on a Laptop
Akhilesh · 2026-05-25 · via DEV Community

GPT-2 has 117M parameters. LLaMA-2 has 7B. GPT-3 has 175B.

Full fine-tuning means updating every single parameter. For GPT-2 that's manageable. For LLaMA-2 it needs 28GB of GPU memory just to store the gradients. For GPT-3 it's basically impossible without a cluster.

LoRA (Low-Rank Adaptation) solves this. Instead of updating the full weight matrices, it adds tiny trainable modules next to them. The original weights stay frozen. Only the tiny modules train. At the end you merge them back.

You go from needing 8 A100s to needing a consumer GPU. Or sometimes just a CPU.


What You'll Learn Here

  • Why full fine-tuning doesn't scale
  • The math behind LoRA in plain English
  • Rank, alpha, and dropout: what they control
  • Which layers to apply LoRA to
  • Setting up LoRA with HuggingFace PEFT
  • QLoRA: quantization + LoRA for consumer hardware
  • Merging LoRA weights for deployment
  • Comparing LoRA to full fine-tuning

The Problem With Full Fine-Tuning at Scale

# Memory requirements for fine-tuning
def estimate_gpu_memory(n_params_billions, dtype='float32'):
    bytes_per_param = {
        'float32': 4,
        'float16': 2,
        'int8':    1,
        'int4':    0.5
    }

    bpp       = bytes_per_param[dtype]
    model_gb  = n_params_billions * 1e9 * bpp / 1e9

    # For full fine-tuning you also need:
    # - Gradients: same size as weights
    # - Adam optimizer states: 2x weight size
    # - Activations: depends on batch size (rough estimate 2x)
    total_gb = model_gb * (1 + 1 + 2 + 2)   # weights + grads + optimizer + activations

    return model_gb, total_gb

print(f"{'Model':<15} {'Params':<10} {'Weights':<12} {'Full FT Memory'}")
print("-" * 50)
for name, params in [('GPT-2', 0.117), ('LLaMA-7B', 7), ('LLaMA-13B', 13), ('GPT-3', 175)]:
    w_gb, total = estimate_gpu_memory(params, 'float32')
    print(f"{name:<15} {params:<10} {w_gb:.1f} GB      {total:.0f} GB")

Enter fullscreen mode Exit fullscreen mode

Output:

Model           Params     Weights      Full FT Memory
--------------------------------------------------
GPT-2           0.117      0.5 GB      2 GB
LLaMA-7B        7          28.0 GB     168 GB
LLaMA-13B       13         52.0 GB     312 GB
GPT-3           175        700.0 GB    4200 GB

Enter fullscreen mode Exit fullscreen mode

LLaMA-7B full fine-tuning needs 168GB of GPU memory. A single A100 has 80GB. You need at least 3 of them for $30,000+.

LoRA changes this dramatically.


How LoRA Works: The Math

A pretrained weight matrix W has shape (d_out, d_in). Full fine-tuning updates W directly:

W_new = W_pretrained + ΔW

Enter fullscreen mode Exit fullscreen mode

ΔW has the same shape as W. That's the problem. It's huge.

LoRA's insight: the update ΔW doesn't need to have full rank. Most meaningful weight changes during fine-tuning lie in a low-dimensional subspace.

Instead of learning ΔW directly, LoRA approximates it as the product of two small matrices:

ΔW ≈ B × A

where:
  A has shape (r, d_in)   - projects down to rank r
  B has shape (d_out, r)  - projects back up to d_out

r << min(d_in, d_out)

Enter fullscreen mode Exit fullscreen mode

During forward pass:

output = x @ W^T + x @ A^T @ B^T × (alpha/r)
       = (pretrained part) + (LoRA part)

Enter fullscreen mode Exit fullscreen mode

W stays frozen. Only A and B train. Total parameters: r * (d_in + d_out) instead of d_in * d_out.

import torch
import torch.nn as nn
import math

class LoRALayer(nn.Module):
    def __init__(self, original_layer, rank=8, alpha=16, dropout=0.1):
        super().__init__()

        self.original = original_layer
        self.rank     = rank
        self.alpha    = alpha
        self.scaling  = alpha / rank

        # Freeze the original layer
        for param in self.original.parameters():
            param.requires_grad = False

        # LoRA matrices A and B
        in_features  = original_layer.in_features
        out_features = original_layer.out_features

        self.lora_A = nn.Linear(in_features,  rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)

        self.dropout = nn.Dropout(dropout)

        # Initialize: A with Gaussian, B with zeros
        # B=0 means LoRA starts as identity (no change at init)
        nn.init.kaiming_uniform_(self.lora_A.weight, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B.weight)

    def forward(self, x):
        # Original output (frozen)
        original_out = self.original(x)

        # LoRA delta
        lora_out = self.lora_B(self.lora_A(self.dropout(x))) * self.scaling

        return original_out + lora_out

    def parameter_count(self):
        original_params = sum(p.numel() for p in self.original.parameters())
        lora_params     = sum(p.numel() for p in self.lora_A.parameters()) + \
                          sum(p.numel() for p in self.lora_B.parameters())
        return original_params, lora_params

# Test LoRA layer
original_linear = nn.Linear(768, 768)  # typical BERT attention dimension
lora_linear     = LoRALayer(original_linear, rank=8, alpha=16)

original_params, lora_params = lora_linear.parameter_count()
print(f"Original parameters: {original_params:,}")
print(f"LoRA parameters:     {lora_params:,}")
print(f"Parameter reduction: {lora_params/original_params:.1%} of original")

x   = torch.randn(2, 10, 768)
out = lora_linear(x)
print(f"\nInput shape:  {x.shape}")
print(f"Output shape: {out.shape}")

Enter fullscreen mode Exit fullscreen mode

Output:

Original parameters: 590,592
LoRA parameters:     12,288
Parameter reduction: 2.1% of original

Input shape:  torch.Size([2, 10, 768])
Output shape: torch.Size([2, 10, 768])

Enter fullscreen mode Exit fullscreen mode

12,288 parameters instead of 590,592. Same output shape. 2.1% of the original.


Rank, Alpha, and What They Control

import pandas as pd

# How rank affects parameter count for a 768x768 matrix
rows = []
for rank in [1, 2, 4, 8, 16, 32, 64]:
    d_in = d_out = 768
    original = d_in * d_out
    lora     = rank * (d_in + d_out)
    rows.append({
        'Rank': rank,
        'LoRA params': lora,
        'Original params': original,
        '% of original': f"{lora/original:.2%}",
        'Reduction factor': f"{original//lora}x"
    })

print(pd.DataFrame(rows).to_string(index=False))

Enter fullscreen mode Exit fullscreen mode

Output:

 Rank  LoRA params  Original params  % of original  Reduction factor
    1         1536           589824           0.26%            384x
    2         3072           589824           0.52%            192x
    4         6144           589824           1.04%             96x
    8        12288           589824           2.08%             48x
   16        24576           589824           4.17%             24x
   32        49152           589824           8.33%             12x
   64        98304           589824          16.67%              6x

Enter fullscreen mode Exit fullscreen mode

Rank (r): how many dimensions to use in the low-rank approximation. Higher rank = more parameters = more expressive but closer to full fine-tuning.

  • r=4 or r=8: most common starting point
  • r=16 to r=32: for harder tasks that need more capacity
  • r=64+: approaching full fine-tuning territory

Alpha (α): scaling factor for the LoRA output. Controls how much influence LoRA has relative to the frozen model.

  • Usually set to alpha = rank (scaling = 1.0)
  • Or alpha = 2 * rank (scaling = 2.0, LoRA has more influence)
  • Common: rank=8, alpha=16 (scaling=2)

Dropout: regularization inside LoRA. Typically 0.05 to 0.1.


Which Layers to Apply LoRA To

In transformers, the attention mechanism has four weight matrices per layer: Q, K, V, and the output projection. The feed-forward layers have two more.

# Common LoRA target modules for different architectures

lora_targets = {
    'BERT / RoBERTa': {
        'targets': ['query', 'key', 'value', 'dense'],
        'note': 'All attention projections'
    },
    'GPT-2': {
        'targets': ['c_attn', 'c_proj'],
        'note': 'Combined QKV and output projection'
    },
    'LLaMA / Mistral': {
        'targets': ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
        'note': 'All attention projections, sometimes gate_proj too'
    },
    'Minimal (fastest)': {
        'targets': ['q_proj', 'v_proj'],
        'note': 'Only query and value, fewer params but often enough'
    }
}

for arch, info in lora_targets.items():
    print(f"\n{arch}:")
    print(f"  Targets: {info['targets']}")
    print(f"  Note:    {info['note']}")

Enter fullscreen mode Exit fullscreen mode

Research shows that applying LoRA to Q and V only (skipping K) often works nearly as well as all four while using fewer parameters.


LoRA With HuggingFace PEFT

pip install peft

Enter fullscreen mode Exit fullscreen mode

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

model_name = 'roberta-base'
tokenizer  = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=3
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,       # sequence classification
    r=8,                               # rank
    lora_alpha=16,                     # alpha
    lora_dropout=0.1,                  # dropout
    target_modules=['query', 'value'], # apply to Q and V only
    bias='none',                       # don't train biases
    inference_mode=False
)

# Wrap the model with LoRA
model = get_peft_model(base_model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()

Enter fullscreen mode Exit fullscreen mode

Output:

trainable params: 629,764 || all params: 125,277,444 || trainable%: 0.5025

Enter fullscreen mode Exit fullscreen mode

0.5% of parameters. Everything else is frozen.

# Training with LoRA is identical to regular fine-tuning
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
from datasets import load_dataset
import evaluate
import numpy as np

# Load data
dataset   = load_dataset('imdb')
small_train = dataset['train'].select(range(2000))
small_val   = dataset['test'].select(range(500))

def tokenize(examples):
    return tokenizer(examples['text'], truncation=True, max_length=256)

train_ds = small_train.map(tokenize, batched=True, remove_columns=['text'])
val_ds   = small_val.map(tokenize,   batched=True, remove_columns=['text'])
train_ds = train_ds.rename_column('label', 'labels')
val_ds   = val_ds.rename_column('label', 'labels')

accuracy = evaluate.load('accuracy')
def compute_metrics(eval_pred):
    preds = np.argmax(eval_pred.predictions, axis=-1)
    return accuracy.compute(predictions=preds, references=eval_pred.label_ids)

training_args = TrainingArguments(
    output_dir='./lora_model',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=3e-4,          # LoRA can use higher LR than full fine-tuning
    weight_decay=0.01,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    report_to='none',
    fp16=torch.cuda.is_available()
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics
)

trainer.train()
results = trainer.evaluate()
print(f"LoRA fine-tuning accuracy: {results['eval_accuracy']:.3f}")

Enter fullscreen mode Exit fullscreen mode


Saving and Loading LoRA Weights

LoRA's other big advantage: the saved checkpoint is tiny. You only save the LoRA matrices, not the full model.

from peft import PeftModel

# Save only the LoRA weights
model.save_pretrained('./lora_weights')   # saves adapter_config.json and adapter_model.bin
print("LoRA weights saved")

import os
for f in os.listdir('./lora_weights'):
    size = os.path.getsize(f'./lora_weights/{f}') / 1e6
    print(f"  {f}: {size:.1f} MB")

Enter fullscreen mode Exit fullscreen mode

Output:

LoRA weights saved
  adapter_config.json: 0.001 MB
  adapter_model.bin: 2.4 MB     <- only 2.4 MB instead of 500+ MB!

Enter fullscreen mode Exit fullscreen mode

# Load: start with base model, then load LoRA adapter
base_model_for_load = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=3
)
loaded_lora_model = PeftModel.from_pretrained(base_model_for_load, './lora_weights')
loaded_lora_model.eval()
print("LoRA model loaded successfully")

Enter fullscreen mode Exit fullscreen mode


Merging LoRA for Deployment

After training, you can merge the LoRA weights into the base model. Then you have one clean model with no overhead at inference time.

# Merge LoRA into base model
merged_model = model.merge_and_unload()

# Now merged_model is a regular model with no LoRA overhead
print(f"Type after merge: {type(merged_model)}")

# Save the merged model
merged_model.save_pretrained('./merged_model')
tokenizer.save_pretrained('./merged_model')

# Load it like any normal model
from transformers import AutoModelForSequenceClassification
final_model = AutoModelForSequenceClassification.from_pretrained('./merged_model')
print("Merged model loaded as regular model")

# Check: no LoRA parameters, just the full model
n_params = sum(p.numel() for p in final_model.parameters())
print(f"Parameters: {n_params:,}")

Enter fullscreen mode Exit fullscreen mode


QLoRA: 4-bit Quantization + LoRA

QLoRA combines quantization (reducing weight precision to 4-bit) with LoRA. This lets you fine-tune 7B+ models on a single consumer GPU.

pip install bitsandbytes

Enter fullscreen mode Exit fullscreen mode

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # quantize to 4-bit
    bnb_4bit_quant_type='nf4',            # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.float16, # compute in fp16
    bnb_4bit_use_double_quant=True        # double quantization (saves more memory)
)

# Load model in 4-bit (much less memory)
model_name = 'gpt2'   # swap with 'meta-llama/Llama-2-7b-hf' if you have access

qlora_base = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map='auto'          # automatically handles multi-GPU or CPU offload
)

# Required for 4-bit training
qlora_base.config.use_cache           = False
qlora_base.config.pretraining_tp      = 1

# Prepare for LoRA training with quantized model
from peft import prepare_model_for_kbit_training
qlora_base = prepare_model_for_kbit_training(qlora_base)

# Apply LoRA config
qlora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=['c_attn', 'c_proj'],  # GPT-2 specific
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.CAUSAL_LM
)

qlora_model = get_peft_model(qlora_base, qlora_config)
qlora_model.print_trainable_parameters()

Enter fullscreen mode Exit fullscreen mode

Output:

trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.2364

Enter fullscreen mode Exit fullscreen mode

# Memory savings with QLoRA
memory_estimates = {
    'Full fine-tuning (fp32)':     '~28 GB for 7B model',
    'Full fine-tuning (fp16)':     '~14 GB for 7B model',
    'LoRA (fp16)':                 '~8 GB for 7B model',
    'QLoRA (4-bit + LoRA)':       '~4 GB for 7B model',
}

print("Memory requirements for 7B parameter model:")
for method, memory in memory_estimates.items():
    print(f"  {method:<35}: {memory}")

Enter fullscreen mode Exit fullscreen mode


LoRA vs Full Fine-Tuning: Benchmark Comparison

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer, TrainingArguments, Trainer,
    DataCollatorWithPadding
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
import evaluate, numpy as np, time, torch

model_name = 'distilbert-base-uncased'
tokenizer  = AutoTokenizer.from_pretrained(model_name)

dataset    = load_dataset('imdb')
small_train = dataset['train'].select(range(1000))
small_val   = dataset['test'].select(range(300))

def tokenize(examples):
    return tokenizer(examples['text'], truncation=True, max_length=128)

train_ds = small_train.map(tokenize, batched=True, remove_columns=['text'])
val_ds   = small_val.map(tokenize,   batched=True, remove_columns=['text'])
train_ds = train_ds.rename_column('label', 'labels')
val_ds   = val_ds.rename_column('label', 'labels')

accuracy = evaluate.load('accuracy')
def compute_metrics(eval_pred):
    preds = np.argmax(eval_pred.predictions, axis=-1)
    return accuracy.compute(predictions=preds, references=eval_pred.label_ids)

def run_experiment(use_lora, rank=8):
    base = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    if use_lora:
        config = LoraConfig(
            task_type=TaskType.SEQ_CLS, r=rank,
            lora_alpha=rank*2, lora_dropout=0.1,
            target_modules=['q_lin', 'v_lin'], bias='none'
        )
        model = get_peft_model(base, config)
        lr    = 3e-4
    else:
        model = base
        lr    = 2e-5

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total     = sum(p.numel() for p in model.parameters())

    args = TrainingArguments(
        output_dir=f'./exp_{"lora" if use_lora else "full"}',
        num_train_epochs=3,
        per_device_train_batch_size=16,
        learning_rate=lr,
        evaluation_strategy='epoch',
        report_to='none',
        logging_steps=999
    )

    trainer = Trainer(
        model=model, args=args,
        train_dataset=train_ds, eval_dataset=val_ds,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer),
        compute_metrics=compute_metrics
    )

    start   = time.time()
    trainer.train()
    elapsed = time.time() - start
    results = trainer.evaluate()

    return {
        'method':     f'LoRA (r={rank})' if use_lora else 'Full fine-tuning',
        'trainable':  f'{trainable:,} ({trainable/total:.1%})',
        'accuracy':   f"{results['eval_accuracy']:.3f}",
        'time_s':     f"{elapsed:.0f}s"
    }

print("Running comparison (this takes a few minutes)...")
results = [
    run_experiment(use_lora=False),
    run_experiment(use_lora=True, rank=4),
    run_experiment(use_lora=True, rank=8),
    run_experiment(use_lora=True, rank=16),
]

print(f"\n{'Method':<20} {'Trainable Params':<25} {'Accuracy':<12} {'Time'}")
print("-" * 70)
for r in results:
    print(f"{r['method']:<20} {r['trainable']:<25} {r['accuracy']:<12} {r['time_s']}")

Enter fullscreen mode Exit fullscreen mode

Typical output:

Method               Trainable Params          Accuracy     Time
----------------------------------------------------------------------
Full fine-tuning     66,955,010 (100%)         0.934        148s
LoRA (r=4)           147,968 (0.22%)           0.921        102s
LoRA (r=8)           295,168 (0.44%)           0.928        108s
LoRA (r=16)          589,824 (0.88%)           0.931        115s

Enter fullscreen mode Exit fullscreen mode

LoRA with r=8 gets 99.4% of full fine-tuning accuracy with 0.44% of the parameters and 73% of the training time. For larger models, the savings are even more dramatic.


When to Use LoRA vs Full Fine-Tuning

Use LoRA when:
  - Model is large (> 1B parameters)
  - GPU memory is limited
  - You want to share adapters separately from the base model
  - You want to try many different tasks with one base model
  - Quick iteration is more important than peak accuracy

Use full fine-tuning when:
  - Model is small (< 500M parameters)
  - You have plenty of GPU memory
  - Peak accuracy matters more than speed
  - You only have one task to fine-tune for
  - You'll merge and ship a single final model

Enter fullscreen mode Exit fullscreen mode


Quick Cheat Sheet

Concept What it means
Rank (r) Dimensions of LoRA matrices. r=8 is a good default.
Alpha (α) Scaling. Set to 2*r or same as r.
Target modules Which weight matrices to apply LoRA to. Start with Q and V.
Scaling factor alpha/rank. Controls LoRA strength.
Merge and unload Bake LoRA into base weights. One clean model for deployment.
QLoRA 4-bit quantization + LoRA. Fine-tune 7B on 4GB GPU.
Task Code
Configure LoRA LoraConfig(r=8, lora_alpha=16, target_modules=[...])
Apply to model get_peft_model(base_model, lora_config)
Check params model.print_trainable_parameters()
Save adapters model.save_pretrained('./lora_weights')
Load adapters PeftModel.from_pretrained(base_model, './lora_weights')
Merge weights model.merge_and_unload()
QLoRA setup BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4')

Practice Challenges

Level 1:
Apply LoRA to distilbert-base-uncased for a 3-class classification task. Use r=4 then r=16. Print the trainable parameter counts for both. Fine-tune each for 2 epochs. Compare accuracy vs parameter count.

Level 2:
Fine-tune the same dataset three ways: full fine-tuning, LoRA with r=8, and frozen backbone (only train the classification head). Plot a bar chart comparing accuracy, training time, and trainable parameter count for all three approaches.

Level 3:
Set up QLoRA with bitsandbytes on any GPT-style model. Verify it loads in 4-bit. Fine-tune on a small instruction dataset for 1 epoch. Generate 5 responses and compare quality to the non-fine-tuned base model. Report GPU memory usage before and after loading.


References


Next up, Post 97: Embeddings and Vector Search: Semantic Search That Works. How to turn sentences into vectors, find similar content with cosine similarity, and build a semantic search engine with FAISS or ChromaDB.