惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

人人都是产品经理
人人都是产品经理
W
WeLiveSecurity
Recorded Future
Recorded Future
P
Privacy & Cybersecurity Law Blog
V
Vulnerabilities – Threatpost
C
Cybersecurity and Infrastructure Security Agency CISA
G
GRAHAM CLULEY
S
Securelist
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
小众软件
小众软件
The Hacker News
The Hacker News
The Cloudflare Blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
V
V2EX
C
Cisco Blogs
Cisco Talos Blog
Cisco Talos Blog
腾讯CDC
Recent Announcements
Recent Announcements
Jina AI
Jina AI
K
Kaspersky official blog
The GitHub Blog
The GitHub Blog
云风的 BLOG
云风的 BLOG
酷 壳 – CoolShell
酷 壳 – CoolShell
GbyAI
GbyAI
F
Fortinet All Blogs
T
ThreatConnect
S
Schneier on Security
罗磊的独立博客
Y
Y Combinator Blog
C
Check Point Blog
T
The Exploit Database - CXSecurity.com
宝玉的分享
宝玉的分享
aimingoo的专栏
aimingoo的专栏
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
I
Intezer
F
Full Disclosure
T
Troy Hunt's Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
WordPress大学
WordPress大学
Application and Cybersecurity Blog
Application and Cybersecurity Blog
V
V2EX - 技术
C
Comments on: Blog
T
Tenable Blog
Project Zero
Project Zero
H
Help Net Security
A
Arctic Wolf
Google DeepMind News
Google DeepMind News
NISL@THU
NISL@THU
博客园 - 【当耐特】
F
Fox-IT International blog

DEV Community

Cooking an AI Campaign in 5 Minutes with Google Cloud AI APIs Your PM Retrospectives Are Lying to You How I Built a Free, Self-Hosted Pipeline That Auto-Generates Faceless YouTube Shorts TypeScript 54 to 58: The Features That Actually Matter in 2026 How to Tailor Your CV to Any Job Posting in 2026 What Is a Frontend Developer Roadmap and Why You Need One Google shipped three Gemini "Flash" models. Picking the wrong one could 6 your AI bill Building an MCP server so Claude can query my SaaS analytics directly Google I/O 2026 and the Rise of the AI Ecosystem Your Docker Builds Are Slow Because You're Doing It Wrong (And I Built a Tool to Prove It) How do you verify GitHub contributions without trusting self-reported skills? CV vs Resume: What's the Difference and Which Do You Need? student Devs: Build AI Agents & Compete for $55K in Prizes 🚀 How to Write a Cover Letter That Actually Gets You Interviews Battle-Tested: What Getting Hacked Taught Me About Web & Cyber Security Unda folders za kuandika code >> mkdir src >> cd src >> mkdir controllers database routes services utils >> cd .. Directory: C:\Users\mwaki\microfinance-system Mode LastWriteTime Length Name Code Coverage .NET AI slop debt" is technical debt on fast forward. Nobody's ready. Multi-Head Latent Attention (MLA) Memoria - A Local AI Reading Companion Powered by Gemma 4 Stop Trusting Your Accuracy Score: A Practical Guide to Evaluating Logistic Regression Models Serious Question: Is the Developer Job Actually in Risk Due to AI? published: true tags: #discuss #career #ai #help rav2d: We ported an AV2 video decoder from C to Rust — here's why Your New Domain's First Week of GA4 Is a Lie: 4 Days of Raw Data from a Launch Gemma Guide - Real-Time Spatial Awareness for Blind Users From YAML to AI Agents: Building Smarter DevOps Pipelines with MCP A Field Guide to Human–AI Relations (For the Newly Bewildered Mortal) The AI Agent That Learns While It Works — A Complete Guide to Hermes Agent Inviting collaborators to work on ArchScope ArchScope is an interactive web-based tool that lets you design, visualize, and test system architectures with real-time performance simulations. Github - ArchScope is an interactive web-based tool that lets you Gemma 4: Google's Open-Weight AI Is a Game Changer for Developers Confessions of a Git Beginner: Why the Terminal Stopped Scaring Me Docker 容器化实战:从零到生产部署 🚀 I Built a Full Stack Miro Clone with Real-Time Collaboration using Next.js Building an African Economic Data Pipeline with Python, DuckDB & World Bank API llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet Intigriti Challenge 0526 Writeup Business Logic Flaws: How Attackers Skip Steps in Your App to Get What They Should Never Have Why Vibe Coders Need Boilerplates to Save Time, Tokens, and Build More Secure SaaS Projects Idle Cloud Cost Is the New Egress Cost Quark's Outlines: Python Traceback Objects Ghost in the Stack (Part 1): Why uninitialized variables remember old data Building a High-Performance Local Chess Assistant Extension with WebAssembly Stockfish and Manifest V3 Breaking the Trade-off Between Self-Custody and Intelligent Automation on the Stellar Network I Open-Sourced a Practical Fullstack Interview Preparation Repository (React + Node + System Design) 🚀 How I Started Coding as a Student (Beginner-Friendly Guide) WordPress vs. Ghost: Why Automated Bot Attacks Are Making us think much I tested 4 AI agent-governance tools against an open spec - here's the matrix zkML Inference Proof: What the Receipt Proves, and What the Model Still Does Not I Scored 1000/1000 on AWS Certified AI Practitioner (AIF-C01) Here's Every Resource I Used Go - Struct and Interface Handling JSON Requests in Go Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS How I Caught and Fixed an N+1 Query in My Django REST API I got tired of paying $10/month to remove image backgrounds – so I built it for free How to Start Coding as a Student: A Complete Beginner’s Guide 🚀 Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS What Are Buffers? Build AI Agents with Hot Dev The Client Onboarding Checklist That Prevents 90% of Project Problems Scalable Treasure Hunts Are a Myth, But We Almost Made One Gemini 3.5 Flash Has a 1M Token Context Window. Here's What You Can Actually Build With It. I built a ultra-polished developer portfolio template using React & Tailwind v4 (with zero-JSX configuration) Gemini CLI Is Dead. Here's the Better Thing That Replaced It Post-quantum cryptography for embedded and IoT: secure boot, TLS and OTA Understanding Optimistic Preloading in Modern Applications Nobody Wants to Read Your Code (And You Don't Want to Read Theirs) A clothing pairing app E2B vs E4B vs 31B Dense: The Practical Guide to Choosing the Right Gemma 4 Model I built an AI app store screenshot generator because Figma made me cry — looking for brutal feedback Hello DEV Community — My Developer Journey Begins Adaptable apps on ChromeOS: a post-mortem The WordPress Paradox: Why It’s Here to Stay (and How to Stop Ruining It) I built a local voice AI that can change to 9 different personalities! UXRay: I Built an AI That Roasts Your UI Like a Senior Designer Would Wyrly DI: Type-safe Dependency Injection for Modern TypeScript The contract is the interface: agent-driven Steampipe Stave in one command Gemma 4's Hidden Superpower: Why Built-in Thinking Tokens Change Everything for Evaluation Tasks ⚡ WordPress Performance: The Real Truth They Don't Tell You A Mobile App Usually Needs an Admin System First Customer Portals Should Remove Repeated Admin Work Episode 4: The Time Loop (Layers & Caching) I Built ContextForge with Gemma 4: A Project Memory Generator for Developers and AI Coding Agents Why shadow DOM beat iframe for inline tooltips HOW TO CREATE USER AND ASSIGN ROLES IN AZURE WITH ENTRA ID When AI Blackmail Goes Viral Episode 3: The Secret Scroll (The Dockerfile) Monte Carlo Simulation for Engineers: Turning Uncertainty Into Numbers The tokens-per-byte trap: character-level 'compression' adds tokens Nobody Reads Your Code Anymore Why I built a collection of 5 free, zero-signup career finance tools for solo builders 🚀 New React Challenge: Instant UI with useOptimistic Resolvendo a Alucinação da IA na Arquitetura de Software com Code Property Graphs e .NET 9 S1 — Clean Backtrace Crashes: How to Diagnose and Fix Them Cómo solucionar el bucle infinito en useEffect con objetos y arrays The Brutal Reality of Running Gemma 4 Locally I made Claude Code refuse to write code unless the ticket scores 80/100 I Fed React's Entire Hooks Transition History to Gemma 4. Here's What It Found That We Missed. Building a Private RAG System: Lessons from a Local-First AI Journal CodePulse AI — Reviving an AI-Powered Repository Intelligence Platform How to Split Video into Segments with FFmpeg (CLI + API)
95. Fine-Tuning LLMs: Make a General Model Do Your Specific Job
Akhilesh · 2026-05-23 · via DEV Community

A general language model knows a little about everything.

It knows some medicine. Some law. Some code. Some cooking. But it doesn't know your specific domain deeply. It doesn't know your company's tone, your product's terminology, or your task's format.

Fine-tuning fixes this. You take a pretrained model that already understands language and specialize it for your specific task with a fraction of the data and compute you'd need to train from scratch.

This post covers how to do it properly.


What You'll Learn Here

  • What fine-tuning actually does to a pretrained model
  • The three types of fine-tuning and when to use each
  • Preparing datasets for instruction fine-tuning
  • Full fine-tuning with the HuggingFace Trainer
  • Evaluating fine-tuned models properly
  • Catastrophic forgetting and how to avoid it
  • Tips that actually make a difference

What Fine-Tuning Does

A pretrained LLM has learned a general representation of language from billions of tokens. Its weights encode grammar, facts, reasoning patterns, and world knowledge.

Fine-tuning continues training on a smaller, task-specific dataset. The model adapts its weights slightly to specialize. The key word is slightly. You don't want to destroy the general knowledge. You want to build on it.

Pretrained model:
  - Knows language deeply
  - Broad but shallow domain knowledge
  - No concept of your task format

After fine-tuning:
  - Still knows language
  - Deep knowledge of your domain
  - Understands your task format
  - Responds in your required style

Enter fullscreen mode Exit fullscreen mode

The weights change. But not completely. A well-fine-tuned model retains its general capabilities while gaining task-specific expertise.


Three Types of Fine-Tuning

Type 1: Full Fine-Tuning
Update all weights. Best results. Expensive. Needs lots of data. Risk of catastrophic forgetting.

Type 2: Feature Extraction (Frozen backbone)
Freeze the pretrained model. Only train a new head (classification layer, etc.). Fast. Needs very little data. Limited adaptation.

Type 3: Parameter-Efficient Fine-Tuning (LoRA, adapters)
Add small trainable modules. Freeze most of the model. Train only a tiny fraction of parameters. Best of both worlds. Covered deeply in Post 96.

# Type 1: Full fine-tuning
for param in model.parameters():
    param.requires_grad = True   # all params update

# Type 2: Frozen backbone
for param in model.base_model.parameters():
    param.requires_grad = False  # freeze backbone
# only classifier head trains

# Type 3: LoRA (simplified)
# Covered in Post 96

Enter fullscreen mode Exit fullscreen mode


Dataset Preparation

Good data beats a good model almost every time. This is where most fine-tuning projects live or die.

For classification fine-tuning:

from datasets import Dataset, DatasetDict
import pandas as pd

# Your labeled data
data = {
    'text': [
        "The patient presented with acute chest pain radiating to the left arm.",
        "The quarterly earnings exceeded analyst expectations by 15%.",
        "The defendant claims he was not present at the scene of the crime.",
        "Treatment with metformin reduced HbA1c levels significantly.",
        "Revenue growth was driven by strong performance in cloud services.",
        "The prosecution presented DNA evidence linking the suspect to the crime.",
        "MRI results showed no signs of cerebral hemorrhage.",
        "Operating margins expanded by 200 basis points year over year.",
        "The jury found the defendant not guilty on all counts.",
        "The patient was discharged after a three-day hospitalization.",
    ],
    'label': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0]  # 0=medical, 1=finance, 2=legal
}

df = pd.DataFrame(data)

# Train/val split
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
val_dataset   = Dataset.from_pandas(val_df.reset_index(drop=True))

dataset = DatasetDict({'train': train_dataset, 'validation': val_dataset})
print(dataset)

Enter fullscreen mode Exit fullscreen mode

For instruction fine-tuning (making a model follow prompts):

# Instruction format used by most modern LLMs
def format_instruction(example):
    return f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""

# Example instruction dataset
instruction_data = [
    {
        'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',
        'input': 'Patient reports persistent cough and shortness of breath for 3 weeks.',
        'output': 'symptom'
    },
    {
        'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',
        'input': 'Prescribed amoxicillin 500mg three times daily for 7 days.',
        'output': 'treatment'
    },
    {
        'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',
        'input': 'Confirmed diagnosis of type 2 diabetes mellitus based on HbA1c of 7.8%.',
        'output': 'diagnosis'
    },
]

for example in instruction_data:
    print(format_instruction(example))
    print("-" * 50)

Enter fullscreen mode Exit fullscreen mode


Data Quality Checklist

Before fine-tuning, verify your data:

import pandas as pd
import numpy as np

def audit_dataset(df, text_col='text', label_col='label'):
    print("=" * 50)
    print("DATASET AUDIT REPORT")
    print("=" * 50)

    # Size
    print(f"\nTotal examples: {len(df):,}")

    # Class distribution
    print(f"\nClass distribution:")
    dist = df[label_col].value_counts(normalize=True)
    for label, pct in dist.items():
        count = df[label_col].value_counts()[label]
        print(f"  Class {label}: {count} ({pct:.1%})")

    # Imbalance check
    max_class = dist.max()
    min_class = dist.min()
    ratio     = max_class / min_class
    if ratio > 5:
        print(f"  WARNING: Imbalance ratio {ratio:.1f}x. Consider oversampling or class weights.")

    # Text length
    lengths = df[text_col].str.len()
    print(f"\nText length:")
    print(f"  Min:    {lengths.min()}")
    print(f"  Max:    {lengths.max()}")
    print(f"  Median: {lengths.median():.0f}")
    print(f"  Mean:   {lengths.mean():.0f}")

    # Long texts warning
    if lengths.max() > 512 * 4:  # rough estimate of 512 tokens
        print(f"  WARNING: Some texts may exceed token limits. Check truncation strategy.")

    # Duplicates
    n_dupes = df[text_col].duplicated().sum()
    if n_dupes > 0:
        print(f"\n  WARNING: {n_dupes} duplicate texts found. Remove before training.")

    # Missing values
    missing = df.isnull().sum().sum()
    if missing > 0:
        print(f"\n  WARNING: {missing} missing values found.")
    else:
        print(f"\nNo missing values.")

    print("=" * 50)

audit_dataset(pd.DataFrame(data))

Enter fullscreen mode Exit fullscreen mode


Full Fine-Tuning for Sequence Classification

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    EarlyStoppingCallback
)
import evaluate
import numpy as np
import torch

model_name  = 'distilbert-base-uncased'
num_labels  = 3
label_names = ['medical', 'finance', 'legal']

id2label = {i: l for i, l in enumerate(label_names)}
label2id = {l: i for i, l in enumerate(label_names)}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding=False,       # DataCollator will pad dynamically
        max_length=256
    )

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val   = val_dataset.map(tokenize_function, batched=True)

# Model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id
)

# Metrics
accuracy = evaluate.load('accuracy')
f1_metric = evaluate.load('f1')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions    = np.argmax(logits, axis=-1)
    acc = accuracy.compute(predictions=predictions, references=labels)['accuracy']
    f1  = f1_metric.compute(
        predictions=predictions, references=labels, average='weighted'
    )['f1']
    return {'accuracy': acc, 'f1': f1}

# Training arguments
training_args = TrainingArguments(
    output_dir='./checkpoints/domain_classifier',

    # Training schedule
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,

    # Optimization
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,            # warmup for 10% of steps
    lr_scheduler_type='cosine',  # cosine decay after warmup

    # Evaluation
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    greater_is_better=True,

    # Logging
    logging_steps=10,
    logging_dir='./logs',
    report_to='none',

    # Efficiency
    fp16=torch.cuda.is_available(),  # mixed precision on GPU
    dataloader_num_workers=0,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

# Train
print("Starting fine-tuning...")
trainer.train()

# Evaluate
results = trainer.evaluate()
print(f"\nFinal Results:")
print(f"  Accuracy: {results['eval_accuracy']:.3f}")
print(f"  F1:       {results['eval_f1']:.3f}")

Enter fullscreen mode Exit fullscreen mode


Evaluating a Fine-Tuned Model Properly

Accuracy alone isn't enough. Look at per-class performance, confusion matrix, and error cases.

from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import torch

# Get predictions on validation set
model.eval()
all_preds  = []
all_labels = []

val_dataloader = trainer.get_eval_dataloader()

with torch.no_grad():
    for batch in val_dataloader:
        batch   = {k: v.to(model.device) for k, v in batch.items()}
        outputs = model(**batch)
        preds   = torch.argmax(outputs.logits, dim=-1)

        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(batch['labels'].cpu().numpy())

# Classification report
print("Classification Report:")
print(classification_report(all_labels, all_preds, target_names=label_names))

# Confusion matrix
cm = confusion_matrix(all_labels, all_preds)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=label_names, yticklabels=label_names)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix - Fine-tuned DistilBERT')
plt.tight_layout()
plt.savefig('fine_tune_confusion.png', dpi=100)
plt.show()

Enter fullscreen mode Exit fullscreen mode

# Error analysis: look at what the model gets wrong
errors = []
texts  = val_df['text'].tolist()

for i, (pred, true) in enumerate(zip(all_preds, all_labels)):
    if pred != true:
        errors.append({
            'text':      texts[i],
            'true':      label_names[true],
            'predicted': label_names[pred]
        })

print(f"\nErrors ({len(errors)} out of {len(all_labels)}):")
for e in errors:
    print(f"\n  True: {e['true']}, Predicted: {e['predicted']}")
    print(f"  Text: '{e['text'][:80]}...'")

Enter fullscreen mode Exit fullscreen mode

Error analysis is often the most valuable step. Understanding why the model gets specific examples wrong tells you what data to add next.


Catastrophic Forgetting: The Real Risk

When you fine-tune on a small dataset, the model can forget what it learned during pretraining. Weights move too far from their pretrained values. General capabilities degrade.

# Signs of catastrophic forgetting:
# 1. Model performs well on your task but fails on general text
# 2. Perplexity on general text spikes
# 3. Model generates incoherent text outside your domain

# Prevent it with:

# 1. Low learning rate (2e-5 is usually safe for BERT-based models)
training_args_safe = TrainingArguments(
    learning_rate=2e-5,        # not 1e-3 or 1e-4
    weight_decay=0.01,         # L2 regularization
    warmup_ratio=0.1,
    num_train_epochs=3,        # not 50
    output_dir='./safe_ft'
)

# 2. Freeze early layers (they contain general language knowledge)
def freeze_early_layers(model, n_frozen_layers=4):
    # Freeze embedding layers
    for param in model.distilbert.embeddings.parameters():
        param.requires_grad = False

    # Freeze first n transformer layers
    for layer in model.distilbert.transformer.layer[:n_frozen_layers]:
        for param in layer.parameters():
            param.requires_grad = False

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total     = sum(p.numel() for p in model.parameters())
    print(f"Trainable: {trainable:,} / {total:,} ({trainable/total:.1%})")

freeze_early_layers(model, n_frozen_layers=4)

# 3. Use a small dataset? Consider LoRA (Post 96) instead of full fine-tuning

Enter fullscreen mode Exit fullscreen mode


Instruction Fine-Tuning a Generative Model

For causal LLMs (GPT-style), you format the data as prompts and completions.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import Dataset
import torch

# Load a small generative model
model_name = 'gpt2'
tokenizer  = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name)
model.config.use_cache = False   # required for gradient checkpointing

# Instruction dataset
instructions = [
    {
        'prompt': "### Instruction:\nSummarize this in one sentence.\n\n### Input:\nMachine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. It uses algorithms to parse data, learn from it, and make informed decisions.\n\n### Response:\n",
        'completion': "Machine learning allows computers to learn from data and make decisions without explicit programming."
    },
    {
        'prompt': "### Instruction:\nSummarize this in one sentence.\n\n### Input:\nThe Eiffel Tower, located in Paris, France, was built between 1887 and 1889 as the entrance arch for the 1889 World's Fair and stands 330 meters tall.\n\n### Response:\n",
        'completion': "The Eiffel Tower is a 330-meter structure in Paris built in 1889 as the entrance arch for the World's Fair."
    },
]

# Tokenize: concatenate prompt + completion, mask prompt in loss
def tokenize_instruction(example, max_length=256):
    full_text = example['prompt'] + example['completion'] + tokenizer.eos_token

    tokenized = tokenizer(
        full_text,
        max_length=max_length,
        truncation=True,
        padding='max_length',
        return_tensors='pt'
    )

    input_ids  = tokenized['input_ids'][0]
    labels     = input_ids.clone()

    # Mask the prompt tokens in loss (we only want to train on completions)
    prompt_ids = tokenizer(example['prompt'], return_tensors='pt')['input_ids'][0]
    prompt_len = len(prompt_ids)
    labels[:prompt_len] = -100   # -100 is ignored in CrossEntropyLoss

    return {
        'input_ids':      input_ids,
        'attention_mask': tokenized['attention_mask'][0],
        'labels':         labels
    }

tokenized_data = [tokenize_instruction(ex) for ex in instructions]

# Convert to dataset
import torch

class InstructionDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx]

train_ds = InstructionDataset(tokenized_data)

# Fine-tune
training_args = TrainingArguments(
    output_dir='./instruct_model',
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,   # effective batch size = 4
    learning_rate=2e-5,
    warmup_steps=10,
    logging_steps=5,
    save_steps=50,
    report_to='none',
    fp16=torch.cuda.is_available()
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
)

trainer.train()
print("Instruction fine-tuning complete")

Enter fullscreen mode Exit fullscreen mode


Testing Your Fine-Tuned Model

# Test the fine-tuned generative model
model.eval()

def generate_response(prompt, max_new_tokens=100, temperature=0.7):
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    generated = output[0][inputs['input_ids'].shape[1]:]
    return tokenizer.decode(generated, skip_special_tokens=True)

# Test prompt
test_prompt = """### Instruction:
Summarize this in one sentence.

### Input:
Neural networks are computing systems inspired by biological neural networks. They consist of layers of interconnected nodes that process information using connectionist approaches to computation.

### Response:
"""

response = generate_response(test_prompt)
print(f"Generated response:\n{response}")

Enter fullscreen mode Exit fullscreen mode


Fine-Tuning Best Practices

# Summary of what actually works

best_practices = {
    'learning_rate': {
        'BERT-based (classification)': '2e-5 to 5e-5',
        'GPT-based (generation)':      '1e-5 to 3e-5',
        'Frozen backbone':             '1e-3 to 1e-4 for head only'
    },
    'batch_size': {
        'recommendation': '16 or 32 if memory allows',
        'small GPU':      'batch=4 + gradient_accumulation=4'
    },
    'epochs': {
        'BERT classification': '2 to 4',
        'GPT generation':      '1 to 3',
        'note':                'More epochs = more overfitting risk'
    },
    'data_size': {
        'frozen backbone':  'Works with 100+ examples',
        'full fine-tuning': 'Need 1000+ for reliable results',
        'instruction FT':   '1000 to 10000 good examples'
    },
    'stopping': {
        'recommendation': 'Always use early stopping',
        'metric':         'Monitor validation loss, not training loss'
    }
}

for category, details in best_practices.items():
    print(f"\n{category.upper()}:")
    for k, v in details.items():
        print(f"  {k}: {v}")

Enter fullscreen mode Exit fullscreen mode


Quick Cheat Sheet

Decision Guidance
How much data do I have? < 500: freeze backbone. 500-5k: full fine-tune. > 5k: great
Which model to start with? DistilBERT for speed, RoBERTa for accuracy
Learning rate 2e-5 for BERT, 1e-5 for GPT, never > 5e-5
Epochs 2-4, use early stopping
Catastrophic forgetting Lower LR, freeze early layers, fewer epochs
Model not learning Raise LR, check data quality, check label correctness
Model overfitting Lower LR, add dropout, add more data, use LoRA
Task Code
Load model AutoModelForSequenceClassification.from_pretrained(name, num_labels=N)
Tokenize tokenizer(texts, truncation=True, padding=False, max_length=256)
Train Trainer(model, args, train_dataset, eval_dataset)
Early stop EarlyStoppingCallback(early_stopping_patience=2)
Save trainer.save_model('./my_model')
Predict trainer.predict(test_dataset)

Practice Challenges

Level 1:
Download any small labeled text dataset from the HuggingFace hub. Fine-tune distilbert-base-uncased on it for 3 epochs. Print the classification report. Compare to a TF-IDF + LogisticRegression baseline.

Level 2:
Fine-tune with and without freezing the first 4 transformer layers. Compare final F1 scores and training time. Which approach is better for your dataset size?

Level 3:
Create your own instruction dataset of 50+ examples for a specific task (code explanation, medical text classification, legal summarization). Fine-tune GPT-2 on it. Test the model with 10 new prompts it hasn't seen. Rate the responses 1-5 and report average quality.


References


Next up, Post 96: LoRA: Fine-Tune a Billion-Parameter Model on a Laptop. Parameter-efficient fine-tuning using rank decomposition. Train 1% of parameters and get 95% of the performance of full fine-tuning.