A general language model knows a little about everything.
It knows some medicine. Some law. Some code. Some cooking. But it doesn't know your specific domain deeply. It doesn't know your company's tone, your product's terminology, or your task's format.
Fine-tuning fixes this. You take a pretrained model that already understands language and specialize it for your specific task with a fraction of the data and compute you'd need to train from scratch.
This post covers how to do it properly.
What You'll Learn Here
- What fine-tuning actually does to a pretrained model
- The three types of fine-tuning and when to use each
- Preparing datasets for instruction fine-tuning
- Full fine-tuning with the HuggingFace Trainer
- Evaluating fine-tuned models properly
- Catastrophic forgetting and how to avoid it
- Tips that actually make a difference
What Fine-Tuning Does
A pretrained LLM has learned a general representation of language from billions of tokens. Its weights encode grammar, facts, reasoning patterns, and world knowledge.
Fine-tuning continues training on a smaller, task-specific dataset. The model adapts its weights slightly to specialize. The key word is slightly. You don't want to destroy the general knowledge. You want to build on it.
Pretrained model:
- Knows language deeply
- Broad but shallow domain knowledge
- No concept of your task format
After fine-tuning:
- Still knows language
- Deep knowledge of your domain
- Understands your task format
- Responds in your required style
The weights change. But not completely. A well-fine-tuned model retains its general capabilities while gaining task-specific expertise.
Three Types of Fine-Tuning
Type 1: Full Fine-Tuning
Update all weights. Best results. Expensive. Needs lots of data. Risk of catastrophic forgetting.
Type 2: Feature Extraction (Frozen backbone)
Freeze the pretrained model. Only train a new head (classification layer, etc.). Fast. Needs very little data. Limited adaptation.
Type 3: Parameter-Efficient Fine-Tuning (LoRA, adapters)
Add small trainable modules. Freeze most of the model. Train only a tiny fraction of parameters. Best of both worlds. Covered deeply in Post 96.
# Type 1: Full fine-tuning
for param in model.parameters():
param.requires_grad = True # all params update
# Type 2: Frozen backbone
for param in model.base_model.parameters():
param.requires_grad = False # freeze backbone
# only classifier head trains
# Type 3: LoRA (simplified)
# Covered in Post 96
Dataset Preparation
Good data beats a good model almost every time. This is where most fine-tuning projects live or die.
For classification fine-tuning:
from datasets import Dataset, DatasetDict
import pandas as pd
# Your labeled data
data = {
'text': [
"The patient presented with acute chest pain radiating to the left arm.",
"The quarterly earnings exceeded analyst expectations by 15%.",
"The defendant claims he was not present at the scene of the crime.",
"Treatment with metformin reduced HbA1c levels significantly.",
"Revenue growth was driven by strong performance in cloud services.",
"The prosecution presented DNA evidence linking the suspect to the crime.",
"MRI results showed no signs of cerebral hemorrhage.",
"Operating margins expanded by 200 basis points year over year.",
"The jury found the defendant not guilty on all counts.",
"The patient was discharged after a three-day hospitalization.",
],
'label': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0] # 0=medical, 1=finance, 2=legal
}
df = pd.DataFrame(data)
# Train/val split
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])
train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
val_dataset = Dataset.from_pandas(val_df.reset_index(drop=True))
dataset = DatasetDict({'train': train_dataset, 'validation': val_dataset})
print(dataset)
For instruction fine-tuning (making a model follow prompts):
# Instruction format used by most modern LLMs
def format_instruction(example):
return f"""### Instruction:
{example['instruction']}
### Input:
{example['input']}
### Response:
{example['output']}"""
# Example instruction dataset
instruction_data = [
{
'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',
'input': 'Patient reports persistent cough and shortness of breath for 3 weeks.',
'output': 'symptom'
},
{
'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',
'input': 'Prescribed amoxicillin 500mg three times daily for 7 days.',
'output': 'treatment'
},
{
'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',
'input': 'Confirmed diagnosis of type 2 diabetes mellitus based on HbA1c of 7.8%.',
'output': 'diagnosis'
},
]
for example in instruction_data:
print(format_instruction(example))
print("-" * 50)
Data Quality Checklist
Before fine-tuning, verify your data:
import pandas as pd
import numpy as np
def audit_dataset(df, text_col='text', label_col='label'):
print("=" * 50)
print("DATASET AUDIT REPORT")
print("=" * 50)
# Size
print(f"\nTotal examples: {len(df):,}")
# Class distribution
print(f"\nClass distribution:")
dist = df[label_col].value_counts(normalize=True)
for label, pct in dist.items():
count = df[label_col].value_counts()[label]
print(f" Class {label}: {count} ({pct:.1%})")
# Imbalance check
max_class = dist.max()
min_class = dist.min()
ratio = max_class / min_class
if ratio > 5:
print(f" WARNING: Imbalance ratio {ratio:.1f}x. Consider oversampling or class weights.")
# Text length
lengths = df[text_col].str.len()
print(f"\nText length:")
print(f" Min: {lengths.min()}")
print(f" Max: {lengths.max()}")
print(f" Median: {lengths.median():.0f}")
print(f" Mean: {lengths.mean():.0f}")
# Long texts warning
if lengths.max() > 512 * 4: # rough estimate of 512 tokens
print(f" WARNING: Some texts may exceed token limits. Check truncation strategy.")
# Duplicates
n_dupes = df[text_col].duplicated().sum()
if n_dupes > 0:
print(f"\n WARNING: {n_dupes} duplicate texts found. Remove before training.")
# Missing values
missing = df.isnull().sum().sum()
if missing > 0:
print(f"\n WARNING: {missing} missing values found.")
else:
print(f"\nNo missing values.")
print("=" * 50)
audit_dataset(pd.DataFrame(data))
Full Fine-Tuning for Sequence Classification
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
DataCollatorWithPadding,
EarlyStoppingCallback
)
import evaluate
import numpy as np
import torch
model_name = 'distilbert-base-uncased'
num_labels = 3
label_names = ['medical', 'finance', 'legal']
id2label = {i: l for i, l in enumerate(label_names)}
label2id = {l: i for i, l in enumerate(label_names)}
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_function(examples):
return tokenizer(
examples['text'],
truncation=True,
padding=False, # DataCollator will pad dynamically
max_length=256
)
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val = val_dataset.map(tokenize_function, batched=True)
# Model
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=num_labels,
id2label=id2label,
label2id=label2id
)
# Metrics
accuracy = evaluate.load('accuracy')
f1_metric = evaluate.load('f1')
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
acc = accuracy.compute(predictions=predictions, references=labels)['accuracy']
f1 = f1_metric.compute(
predictions=predictions, references=labels, average='weighted'
)['f1']
return {'accuracy': acc, 'f1': f1}
# Training arguments
training_args = TrainingArguments(
output_dir='./checkpoints/domain_classifier',
# Training schedule
num_train_epochs=5,
per_device_train_batch_size=8,
per_device_eval_batch_size=16,
# Optimization
learning_rate=2e-5,
weight_decay=0.01,
warmup_ratio=0.1, # warmup for 10% of steps
lr_scheduler_type='cosine', # cosine decay after warmup
# Evaluation
evaluation_strategy='epoch',
save_strategy='epoch',
load_best_model_at_end=True,
metric_for_best_model='f1',
greater_is_better=True,
# Logging
logging_steps=10,
logging_dir='./logs',
report_to='none',
# Efficiency
fp16=torch.cuda.is_available(), # mixed precision on GPU
dataloader_num_workers=0,
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_val,
tokenizer=tokenizer,
data_collator=DataCollatorWithPadding(tokenizer),
compute_metrics=compute_metrics,
callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)
# Train
print("Starting fine-tuning...")
trainer.train()
# Evaluate
results = trainer.evaluate()
print(f"\nFinal Results:")
print(f" Accuracy: {results['eval_accuracy']:.3f}")
print(f" F1: {results['eval_f1']:.3f}")
Evaluating a Fine-Tuned Model Properly
Accuracy alone isn't enough. Look at per-class performance, confusion matrix, and error cases.
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import torch
# Get predictions on validation set
model.eval()
all_preds = []
all_labels = []
val_dataloader = trainer.get_eval_dataloader()
with torch.no_grad():
for batch in val_dataloader:
batch = {k: v.to(model.device) for k, v in batch.items()}
outputs = model(**batch)
preds = torch.argmax(outputs.logits, dim=-1)
all_preds.extend(preds.cpu().numpy())
all_labels.extend(batch['labels'].cpu().numpy())
# Classification report
print("Classification Report:")
print(classification_report(all_labels, all_preds, target_names=label_names))
# Confusion matrix
cm = confusion_matrix(all_labels, all_preds)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=label_names, yticklabels=label_names)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix - Fine-tuned DistilBERT')
plt.tight_layout()
plt.savefig('fine_tune_confusion.png', dpi=100)
plt.show()
# Error analysis: look at what the model gets wrong
errors = []
texts = val_df['text'].tolist()
for i, (pred, true) in enumerate(zip(all_preds, all_labels)):
if pred != true:
errors.append({
'text': texts[i],
'true': label_names[true],
'predicted': label_names[pred]
})
print(f"\nErrors ({len(errors)} out of {len(all_labels)}):")
for e in errors:
print(f"\n True: {e['true']}, Predicted: {e['predicted']}")
print(f" Text: '{e['text'][:80]}...'")
Error analysis is often the most valuable step. Understanding why the model gets specific examples wrong tells you what data to add next.
Catastrophic Forgetting: The Real Risk
When you fine-tune on a small dataset, the model can forget what it learned during pretraining. Weights move too far from their pretrained values. General capabilities degrade.
# Signs of catastrophic forgetting:
# 1. Model performs well on your task but fails on general text
# 2. Perplexity on general text spikes
# 3. Model generates incoherent text outside your domain
# Prevent it with:
# 1. Low learning rate (2e-5 is usually safe for BERT-based models)
training_args_safe = TrainingArguments(
learning_rate=2e-5, # not 1e-3 or 1e-4
weight_decay=0.01, # L2 regularization
warmup_ratio=0.1,
num_train_epochs=3, # not 50
output_dir='./safe_ft'
)
# 2. Freeze early layers (they contain general language knowledge)
def freeze_early_layers(model, n_frozen_layers=4):
# Freeze embedding layers
for param in model.distilbert.embeddings.parameters():
param.requires_grad = False
# Freeze first n transformer layers
for layer in model.distilbert.transformer.layer[:n_frozen_layers]:
for param in layer.parameters():
param.requires_grad = False
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({trainable/total:.1%})")
freeze_early_layers(model, n_frozen_layers=4)
# 3. Use a small dataset? Consider LoRA (Post 96) instead of full fine-tuning
Instruction Fine-Tuning a Generative Model
For causal LLMs (GPT-style), you format the data as prompts and completions.
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import Dataset
import torch
# Load a small generative model
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name)
model.config.use_cache = False # required for gradient checkpointing
# Instruction dataset
instructions = [
{
'prompt': "### Instruction:\nSummarize this in one sentence.\n\n### Input:\nMachine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. It uses algorithms to parse data, learn from it, and make informed decisions.\n\n### Response:\n",
'completion': "Machine learning allows computers to learn from data and make decisions without explicit programming."
},
{
'prompt': "### Instruction:\nSummarize this in one sentence.\n\n### Input:\nThe Eiffel Tower, located in Paris, France, was built between 1887 and 1889 as the entrance arch for the 1889 World's Fair and stands 330 meters tall.\n\n### Response:\n",
'completion': "The Eiffel Tower is a 330-meter structure in Paris built in 1889 as the entrance arch for the World's Fair."
},
]
# Tokenize: concatenate prompt + completion, mask prompt in loss
def tokenize_instruction(example, max_length=256):
full_text = example['prompt'] + example['completion'] + tokenizer.eos_token
tokenized = tokenizer(
full_text,
max_length=max_length,
truncation=True,
padding='max_length',
return_tensors='pt'
)
input_ids = tokenized['input_ids'][0]
labels = input_ids.clone()
# Mask the prompt tokens in loss (we only want to train on completions)
prompt_ids = tokenizer(example['prompt'], return_tensors='pt')['input_ids'][0]
prompt_len = len(prompt_ids)
labels[:prompt_len] = -100 # -100 is ignored in CrossEntropyLoss
return {
'input_ids': input_ids,
'attention_mask': tokenized['attention_mask'][0],
'labels': labels
}
tokenized_data = [tokenize_instruction(ex) for ex in instructions]
# Convert to dataset
import torch
class InstructionDataset(torch.utils.data.Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx]
train_ds = InstructionDataset(tokenized_data)
# Fine-tune
training_args = TrainingArguments(
output_dir='./instruct_model',
num_train_epochs=3,
per_device_train_batch_size=1,
gradient_accumulation_steps=4, # effective batch size = 4
learning_rate=2e-5,
warmup_steps=10,
logging_steps=5,
save_steps=50,
report_to='none',
fp16=torch.cuda.is_available()
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_ds,
)
trainer.train()
print("Instruction fine-tuning complete")
Testing Your Fine-Tuned Model
# Test the fine-tuned generative model
model.eval()
def generate_response(prompt, max_new_tokens=100, temperature=0.7):
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
generated = output[0][inputs['input_ids'].shape[1]:]
return tokenizer.decode(generated, skip_special_tokens=True)
# Test prompt
test_prompt = """### Instruction:
Summarize this in one sentence.
### Input:
Neural networks are computing systems inspired by biological neural networks. They consist of layers of interconnected nodes that process information using connectionist approaches to computation.
### Response:
"""
response = generate_response(test_prompt)
print(f"Generated response:\n{response}")
Fine-Tuning Best Practices
# Summary of what actually works
best_practices = {
'learning_rate': {
'BERT-based (classification)': '2e-5 to 5e-5',
'GPT-based (generation)': '1e-5 to 3e-5',
'Frozen backbone': '1e-3 to 1e-4 for head only'
},
'batch_size': {
'recommendation': '16 or 32 if memory allows',
'small GPU': 'batch=4 + gradient_accumulation=4'
},
'epochs': {
'BERT classification': '2 to 4',
'GPT generation': '1 to 3',
'note': 'More epochs = more overfitting risk'
},
'data_size': {
'frozen backbone': 'Works with 100+ examples',
'full fine-tuning': 'Need 1000+ for reliable results',
'instruction FT': '1000 to 10000 good examples'
},
'stopping': {
'recommendation': 'Always use early stopping',
'metric': 'Monitor validation loss, not training loss'
}
}
for category, details in best_practices.items():
print(f"\n{category.upper()}:")
for k, v in details.items():
print(f" {k}: {v}")
Quick Cheat Sheet
| Decision | Guidance |
|---|---|
| How much data do I have? | < 500: freeze backbone. 500-5k: full fine-tune. > 5k: great |
| Which model to start with? | DistilBERT for speed, RoBERTa for accuracy |
| Learning rate | 2e-5 for BERT, 1e-5 for GPT, never > 5e-5 |
| Epochs | 2-4, use early stopping |
| Catastrophic forgetting | Lower LR, freeze early layers, fewer epochs |
| Model not learning | Raise LR, check data quality, check label correctness |
| Model overfitting | Lower LR, add dropout, add more data, use LoRA |
| Task | Code |
|---|---|
| Load model | AutoModelForSequenceClassification.from_pretrained(name, num_labels=N) |
| Tokenize | tokenizer(texts, truncation=True, padding=False, max_length=256) |
| Train | Trainer(model, args, train_dataset, eval_dataset) |
| Early stop | EarlyStoppingCallback(early_stopping_patience=2) |
| Save | trainer.save_model('./my_model') |
| Predict | trainer.predict(test_dataset) |
Practice Challenges
Level 1:
Download any small labeled text dataset from the HuggingFace hub. Fine-tune distilbert-base-uncased on it for 3 epochs. Print the classification report. Compare to a TF-IDF + LogisticRegression baseline.
Level 2:
Fine-tune with and without freezing the first 4 transformer layers. Compare final F1 scores and training time. Which approach is better for your dataset size?
Level 3:
Create your own instruction dataset of 50+ examples for a specific task (code explanation, medical text classification, legal summarization). Fine-tune GPT-2 on it. Test the model with 10 new prompts it hasn't seen. Rate the responses 1-5 and report average quality.
References
- HuggingFace: Fine-tuning tutorial
- HuggingFace: TrainingArguments docs
- Stanford Alpaca: instruction fine-tuning
- HuggingFace: PEFT library (for LoRA)
Next up, Post 96: LoRA: Fine-Tune a Billion-Parameter Model on a Laptop. Parameter-efficient fine-tuning using rank decomposition. Train 1% of parameters and get 95% of the performance of full fine-tuning.




















