"Accuracy lied to you. Here's the complete toolkit—confusion matrix, precision, recall, F1, ROC/AUC, log loss, and cross-validation—that separates models that look good from models that actually work."
You trained your first classifier, ran .score(), and got 97% accuracy. You shipped it. Three weeks later, your fraud team tells you it's catching zero fraudulent transactions.
Sound familiar? You fell into the accuracy trap—and it's the most common mistake from developers moving into ML.
This guide will give you the mental model and the code to evaluate binary classifiers properly. By the end, you'll know which metrics to reach for, when accuracy actively lies to you, how to read a ROC curve, and the seven pitfalls that silently kill production models.
Why Linear Regression Breaks for Classification
Before we get to evaluation, one minute on why we use logistic regression at all—because understanding the limitation it solves makes the evaluation choices clearer.
When you apply linear regression to a yes/no problem, you get predictions like 1.3 or -0.2. These aren't probabilities. They can't be thresholded reliably. And a single outlier in your training set can physically shift your decision boundary by several units:
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Binary labels: 0 or 1
X = np.array([1, 2, 3, 4, 5, 6, 7, 8]).reshape(-1, 1)
y = np.array([0, 0, 0, 0, 1, 1, 1, 1])
model = LinearRegression().fit(X, y)
print(model.predict([[0]])) # Predicts -0.36 — not a valid probability
print(model.predict([[10]])) # Predicts 1.36 — also not valid
Logistic regression fixes this by wrapping the linear combination in a sigmoid function:
σ(z) = 1 / (1 + e^(-z)) where z = β₀ + β₁X₁ + ... + βₙXₙ
The sigmoid squashes any real number into the interval (0, 1), giving you an actual probability. It also models the log-odds of the positive class linearly, which is the statistician's way of saying "we get interpretable coefficients."
Under the hood, the model is optimized with Maximum Likelihood Estimation, minimizing cross-entropy loss (not squared error). The decision boundary is linear—a straight line in 2D feature space—but the output is a calibrated probability.
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(
n_samples=1000, n_features=10,
n_informative=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Note: C is the inverse of regularization strength (C = 1/λ)
# Smaller C = stronger regularization = less overfitting
model = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)
# Hard class predictions
y_pred = model.predict(X_test)
# Calibrated probabilities — use these for most evaluation tasks
y_prob = model.predict_proba(X_test)[:, 1]
Quick note on regularization: LogisticRegression in scikit-learn uses L2 regularization by default (
penalty='l2'). Usepenalty='l1'withsolver='liblinear'if you want automatic feature selection via sparsity.
The Problem: Accuracy Actively Misleads You on Imbalanced Data
Here's the scenario that trips up almost everyone.
Fraud detection dataset:
- 10,000 transactions
- 9,900 legitimate (99%)
- 100 fraudulent (1%)
Build a model that predicts "legitimate" for every single transaction:
import numpy as np
from sklearn.metrics import accuracy_score
y_true = np.array([0]*9900 + [1]*100)
y_dummy = np.zeros(10000) # Predicts "not fraud" always
print(f"Dummy accuracy: {accuracy_score(y_true, y_dummy):.1%}")
# Output: Dummy accuracy: 99.0%
Your "model" achieves 99% accuracy and catches zero fraud cases.
This isn't a gotcha edge case—it's the normal situation in fraud detection, medical diagnosis, churn prediction, and anomaly detection. Whenever your classes are imbalanced, accuracy is nearly useless as a primary metric.
The root problem: accuracy treats all errors as equal. But missing a fraudulent transaction (false negative) is catastrophically different from flagging a legitimate one (false positive). You need metrics that distinguish between error types.
The Confusion Matrix: Your Evaluation Foundation
Everything useful in binary classification evaluation flows from the confusion matrix—a 2×2 breakdown of where your predictions agree and disagree with reality.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap='Blues')
plt.title("Confusion Matrix")
plt.show()
scikit-learn convention (rows = actual, columns = predicted):
| Predicted: Negative (0) | Predicted: Positive (1) | |
|---|---|---|
| Actual: Negative | True Negative (TN) ✅ | False Positive (FP) ❌ |
| Actual: Positive | False Negative (FN) ❌ | True Positive (TP) ✅ |
Plain English:
- True Positive (TP): You predicted fraud. It was fraud.
- True Negative (TN): You predicted legit. It was legit.
- False Positive (FP): You cried wolf. Customer was innocent. (Type I error)
- False Negative (FN): You missed the fraudster. (Type II error)
⚠️ Heads up: Some textbooks and tools swap the axis convention. When reading someone else's confusion matrix, always check the axis labels before drawing conclusions.
Precision, Recall, F1, and Specificity
Once you have the confusion matrix, every classification metric is just arithmetic on those four numbers.
Precision: "When I fire, do I hit?"
Precision = TP / (TP + FP)
Of all the positives you predicted, what fraction were actually positive? High precision means you rarely raise false alarms.
Reach for precision when false positives are expensive: spam filtering (you don't want to delete legitimate emails), content moderation (you don't want to wrongly remove posts).
Recall (Sensitivity): "Do I catch everything?"
Recall = TP / (TP + FN)
Of all the positives that actually exist, what fraction did you catch? High recall means you miss very few real positives.
Reach for recall when false negatives are dangerous: cancer screening (missing a tumor is catastrophic), fraud detection (missing fraud costs money), churn (missing a leaving customer means lost revenue).
The Unavoidable Trade-off
Lower your classification threshold → you predict positive more often → recall goes up, precision goes down. Raise it → fewer positive predictions → precision goes up, recall goes down. They move in opposite directions; there's no free lunch.
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)
plt.figure(figsize=(8, 5))
plt.plot(thresholds, precisions[:-1], label='Precision', color='blue')
plt.plot(thresholds, recalls[:-1], label='Recall', color='red')
plt.xlabel('Classification Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall Trade-off vs Threshold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
F1 Score: Balancing Both
F1 = 2 × (Precision × Recall) / (Precision + Recall)
The F1 score is the harmonic mean of precision and recall. Unlike the arithmetic mean, it punishes imbalance: a model with precision=1.0 and recall=0.0 gets an F1 of 0.0, not 0.5. Both have to be high to score well.
Use F1 when you need a single headline number and care roughly equally about precision and recall. It's especially useful for comparing models on imbalanced datasets.
Specificity (True Negative Rate): The Clinical Counterpart
Specificity = TN / (TN + FP)
The flip side of recall, but for negatives. "Of all actual negatives, how many did I correctly rule out?" Common in medical contexts:
- High recall (sensitivity): Use for initial screening—catch every possible case.
- High specificity: Use for confirmatory testing—avoid false diagnoses.
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=['Legit', 'Fraud']))
🔑 Read
classification_reportcarefully. The accuracy row at the bottom tells you almost nothing here. Look at the per-class precision, recall, and F1 for your minority class.
Choosing the Right Metric for Your Situation
Here's the decision framework I use before I even start training:
| Question | Answer | → Use |
|---|---|---|
| Is the dataset imbalanced? | Yes | Precision / Recall / F1 / PR-AUC |
| No | Accuracy is acceptable as a secondary metric | |
| FP costly, FN cheap? | Yes | Optimize Precision |
| FN costly, FP cheap? | Yes | Optimize Recall |
| Both costly? | Yes | F1 or cost-weighted metric |
| Need threshold-independent comparison? | Yes | AUC-ROC or AUC-PR |
For fraud, churn, and disease: optimize recall first, then set a precision floor your business can tolerate. For spam filters and recommendation engines: optimize precision, accept some misses.
ROC Curve and AUC: Threshold-Independent Evaluation
All the metrics above assume a fixed decision threshold (typically 0.5). But the right threshold depends on your business context and changes as requirements evolve. How do you compare two models before you've even decided on a threshold?
Enter: the ROC (Receiver Operating Characteristic) curve.
ROC plots True Positive Rate (Recall) on the Y-axis against False Positive Rate on the X-axis, across every possible threshold. Each point on the curve is one threshold value.
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)
plt.figure(figsize=(7, 6))
plt.plot(fpr, tpr, color='steelblue', lw=2, label=f'ROC Curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', lw=1, label='Random Guessing (AUC = 0.5)')
plt.fill_between(fpr, tpr, alpha=0.15, color='steelblue')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Recall / Sensitivity)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
AUC: Reading the Number
AUC (Area Under the ROC Curve) condenses the entire curve into one number that tells you how well your model ranks positives above negatives.
| AUC Value | Interpretation |
|---|---|
| 1.0 | Perfect — model always ranks positives above negatives |
| 0.9 – 1.0 | Outstanding |
| 0.8 – 0.9 | Excellent |
| 0.7 – 0.8 | Acceptable |
| 0.5 | Random guessing — the model has no discriminative ability |
| < 0.5 | Worse than random (flip predictions to get > 0.5) |
The AUC has a beautiful probabilistic interpretation: it equals the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example by your model.
When to Ditch ROC for the Precision-Recall Curve
ROC curves can be overly optimistic on severely imbalanced datasets. Why? FPR (False Positive Rate = FP / (FP + TN)) has the large TN count in the denominator. When there are thousands of true negatives, even many false positives produce a tiny FPR—making your ROC curve look good while your precision is terrible.
Rule of thumb:
- Balanced classes → ROC/AUC is reliable.
- Heavy class imbalance → Use the Precision-Recall curve and AUC-PR instead.
from sklearn.metrics import average_precision_score, PrecisionRecallDisplay
ap = average_precision_score(y_test, y_prob)
display = PrecisionRecallDisplay.from_predictions(
y_test, y_prob, name=f"AP = {ap:.3f}"
)
display.ax_.set_title("Precision-Recall Curve")
plt.show()
Log Loss: The Probabilistic Metric You Should Be Using More
Accuracy, precision, and recall all evaluate hard predictions (the 0/1 decision). But your model produces probabilities, and evaluating only the binary output throws away information.
Log loss (cross-entropy) measures how well-calibrated your probability estimates are:
Log Loss = -(1/n) × Σ [y_i × log(p_i) + (1 - y_i) × log(1 - p_i)]
In plain terms: predict 0.99 probability for a positive that turns out to be negative, and you're penalized harshly. Predict a confident 0.60 instead of 0.51, and you get a better log loss even if both produce the same hard prediction.
Log loss is preferred when:
- Downstream systems consume probabilities, not labels (e.g., expected value calculations)
- You're comparing two models that produce identical accuracy/F1 but different calibration
- You're using the output to set a custom business threshold
from sklearn.metrics import log_loss
ll = log_loss(y_test, y_prob)
print(f"Log Loss: {ll:.4f}")
# Perfect model: 0.0
# Random guessing: ln(2) ≈ 0.693
Lower log loss = better calibrated probabilities. A model with log loss > 0.693 is effectively worse than random probability assignment.
Cross-Validation: Getting Evaluation You Can Trust
Single train/test splits are noisy. If you got lucky (or unlucky) with how data was randomly partitioned, your metrics don't generalize. Cross-validation gives you a reliable estimate.
k-Fold Cross-Validation
Split data into k folds. Train on k-1 folds, test on the remaining fold. Repeat k times (once per fold as the test set). Average the k results.
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Always put preprocessing inside the pipeline!
# This prevents data leakage from the scaler.
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(C=1.0, max_iter=1000, random_state=42))
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Evaluate multiple metrics
for metric in ['accuracy', 'f1', 'roc_auc', 'neg_log_loss']:
scores = cross_val_score(pipe, X, y, cv=cv, scoring=metric)
label = metric.replace('neg_', '-')
print(f"{label:>15}: {scores.mean():.4f} ± {scores.std():.4f}")
accuracy: 0.8920 ± 0.0145
f1: 0.8911 ± 0.0163
roc_auc: 0.9587 ± 0.0098
-log_loss: -0.2734 ± 0.0121
Why Stratified, Not Regular k-Fold?
With imbalanced classes, a random split might put almost all the minority class examples in one fold—making some folds impossible to evaluate meaningfully.
StratifiedKFold preserves the class ratio in each fold. Use it by default for classification, especially with imbalanced data. It's almost always the right choice.
7 Pitfalls That Will Silently Break Your Evaluation
1. Reporting Only Accuracy
Symptom: Your model scores 97% accuracy and gets shipped. It catches nothing useful.
Fix: Always report precision, recall, F1, and AUC alongside accuracy. If classes are imbalanced, accuracy is a secondary metric at best.
2. Data Leakage in Preprocessing
Symptom: Suspiciously high validation metrics that don't hold up in production.
Cause: Fitting your scaler, imputer, or feature selector on the full dataset before splitting, letting test-set information influence your transforms.
# ❌ WRONG — scaler sees test data, leaks information
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # fit on everything
X_train, X_test = train_test_split(X_scaled) # split afterward
# ✅ CORRECT — use a pipeline or manually split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit on train only
X_test_scaled = scaler.transform(X_test) # transform test with train params
3. Using Default 0.5 Threshold Without Questioning It
Symptom: Good AUC, terrible precision or recall in production.
Fix: Find the threshold that matches your business cost ratio, then tune it on a validation set.
# Find threshold that maximizes F1 on validation data
from sklearn.metrics import f1_score
best_f1, best_threshold = 0, 0.5
for threshold in np.arange(0.1, 0.9, 0.01):
y_pred_t = (y_prob >= threshold).astype(int)
f = f1_score(y_test, y_pred_t)
if f > best_f1:
best_f1 = f
best_threshold = threshold
print(f"Best threshold: {best_threshold:.2f}, F1: {best_f1:.4f}")
4. Ignoring Class Imbalance in CV
Symptom: Cross-validation folds have inconsistent class distributions; some folds fail or give wild metric swings.
Fix: Use StratifiedKFold (shown above). Also consider class_weight='balanced' in your model:
model = LogisticRegression(class_weight='balanced', max_iter=1000)
5. Evaluating on the Test Set More Than Once
Symptom: You iterate by checking test metrics, making changes, re-checking—and unknowingly over-fit to the test set.
Fix: Use a three-way split or cross-validation for development; touch the test set exactly once for final reporting.
# Three-way split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)
# Train on X_train, tune on X_val, final eval on X_test (once!)
6. Over-Relying on AUC-ROC with Severe Imbalance
Symptom: ROC-AUC looks great; actual fraud/disease detection rate is awful.
Fix: Switch to AUC-PR (average_precision_score) for heavily imbalanced problems.
7. Skipping a Baseline Comparison
Symptom: You report 0.87 AUC with no context.
Fix: Always compare against a dummy baseline.
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent')
baseline_auc = cross_val_score(dummy, X, y, cv=5, scoring='roc_auc').mean()
model_auc = cross_val_score(pipe, X, y, cv=5, scoring='roc_auc').mean()
print(f"Baseline AUC: {baseline_auc:.3f}")
print(f"Model AUC: {model_auc:.3f}")
print(f"Improvement: {model_auc - baseline_auc:+.3f}")
The Complete Evaluation Workflow
Here's what we should actually run before declaring a model ready:
from sklearn.metrics import (
classification_report, confusion_matrix, ConfusionMatrixDisplay,
roc_auc_score, average_precision_score, log_loss, f1_score
)
def evaluate_classifier(model, X_test, y_test, threshold=0.5):
y_prob = model.predict_proba(X_test)[:, 1]
y_pred = (y_prob >= threshold).astype(int)
print("=" * 50)
print(f"EVALUATION REPORT (threshold = {threshold})")
print("=" * 50)
# 1. Classification report (precision, recall, F1 per class)
print("\n--- Per-Class Metrics ---")
print(classification_report(y_test, y_pred))
# 2. Probabilistic metrics
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"PR-AUC: {average_precision_score(y_test, y_prob):.4f}")
print(f"Log Loss: {log_loss(y_test, y_prob):.4f}")
# 3. Confusion matrix
print("\n--- Confusion Matrix ---")
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm)
disp.plot(cmap='Blues')
plt.title(f"Confusion Matrix (threshold={threshold})")
plt.tight_layout()
plt.show()
evaluate_classifier(model, X_test, y_test, threshold=0.5)
Practical Takeaways Checklist
Before you call any classifier production-ready:
- [ ] Look at the confusion matrix first. Numbers before plots.
- [ ] Report precision, recall, and F1 for the minority class, not just overall accuracy.
- [ ] Use
StratifiedKFoldcross-validation to get reliable metric estimates. - [ ] Compare ROC-AUC and PR-AUC. If classes are imbalanced, PR-AUC is your primary signal.
- [ ] Check log loss to verify your probabilities are well-calibrated, not just your hard predictions.
- [ ] Question the 0.5 threshold. Tune it to match the real cost of FP vs. FN in your domain.
- [ ] Use a Pipeline to prevent data leakage from preprocessing steps.
- [ ] Run a
DummyClassifierbaseline before celebrating your AUC score. - [ ] Reserve your test set. If you've looked at it more than once during development, it's a validation set.
- [ ] Tie your metric choice to a business outcome. "We want to catch 90% of churners while maintaining > 60% precision" beats "maximize F1."
Final Thought
Model evaluation isn't about finding the best model in the abstract—it's about finding the right model for your specific problem. A 95% accuracy model can be completely useless. An 80% accuracy model can save lives or prevent fraud, depending on where it's wrong.
The metrics are just tools. The judgment—knowing which errors your system can tolerate and which it can't—is what makes you a useful engineer, not just a code runner.
Go measure wisely.
Found this useful? I'd love to hear which pitfall stung you hardest—drop it in the comments.































