惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

人人都是产品经理
人人都是产品经理
W
WeLiveSecurity
Recorded Future
Recorded Future
P
Privacy & Cybersecurity Law Blog
V
Vulnerabilities – Threatpost
C
Cybersecurity and Infrastructure Security Agency CISA
G
GRAHAM CLULEY
S
Securelist
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
小众软件
小众软件
The Hacker News
The Hacker News
The Cloudflare Blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
V
V2EX
C
Cisco Blogs
Cisco Talos Blog
Cisco Talos Blog
腾讯CDC
Recent Announcements
Recent Announcements
Jina AI
Jina AI
K
Kaspersky official blog
The GitHub Blog
The GitHub Blog
云风的 BLOG
云风的 BLOG
酷 壳 – CoolShell
酷 壳 – CoolShell
GbyAI
GbyAI
F
Fortinet All Blogs
T
ThreatConnect
S
Schneier on Security
罗磊的独立博客
Y
Y Combinator Blog
C
Check Point Blog
T
The Exploit Database - CXSecurity.com
宝玉的分享
宝玉的分享
aimingoo的专栏
aimingoo的专栏
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
I
Intezer
F
Full Disclosure
T
Troy Hunt's Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
WordPress大学
WordPress大学
Application and Cybersecurity Blog
Application and Cybersecurity Blog
V
V2EX - 技术
C
Comments on: Blog
T
Tenable Blog
Project Zero
Project Zero
H
Help Net Security
A
Arctic Wolf
Google DeepMind News
Google DeepMind News
NISL@THU
NISL@THU
博客园 - 【当耐特】
F
Fox-IT International blog

DEV Community

Serious Question: Is the Developer Job Actually in Risk Due to AI? published: true tags: #discuss #career #ai #help rav2d: We ported an AV2 video decoder from C to Rust — here's why Your New Domain's First Week of GA4 Is a Lie: 4 Days of Raw Data from a Launch Gemma Guide - Real-Time Spatial Awareness for Blind Users From YAML to AI Agents: Building Smarter DevOps Pipelines with MCP A Field Guide to Human–AI Relations (For the Newly Bewildered Mortal) The AI Agent That Learns While It Works — A Complete Guide to Hermes Agent Inviting collaborators to work on ArchScope ArchScope is an interactive web-based tool that lets you design, visualize, and test system architectures with real-time performance simulations. Github - ArchScope is an interactive web-based tool that lets you Gemma 4: Google's Open-Weight AI Is a Game Changer for Developers Confessions of a Git Beginner: Why the Terminal Stopped Scaring Me Docker 容器化实战:从零到生产部署 🚀 I Built a Full Stack Miro Clone with Real-Time Collaboration using Next.js Building an African Economic Data Pipeline with Python, DuckDB & World Bank API llms.txt vs robots.txt vs ai.txt: The Developer's Cheat Sheet Intigriti Challenge 0526 Writeup Business Logic Flaws: How Attackers Skip Steps in Your App to Get What They Should Never Have Why Vibe Coders Need Boilerplates to Save Time, Tokens, and Build More Secure SaaS Projects Idle Cloud Cost Is the New Egress Cost Quark's Outlines: Python Traceback Objects Ghost in the Stack (Part 1): Why uninitialized variables remember old data Building a High-Performance Local Chess Assistant Extension with WebAssembly Stockfish and Manifest V3 Breaking the Trade-off Between Self-Custody and Intelligent Automation on the Stellar Network I Open-Sourced a Practical Fullstack Interview Preparation Repository (React + Node + System Design) 🚀 How I Started Coding as a Student (Beginner-Friendly Guide) WordPress vs. Ghost: Why Automated Bot Attacks Are Making us think much I tested 4 AI agent-governance tools against an open spec - here's the matrix zkML Inference Proof: What the Receipt Proves, and What the Model Still Does Not I Scored 1000/1000 on AWS Certified AI Practitioner (AIF-C01) Here's Every Resource I Used Go - Struct and Interface Handling JSON Requests in Go Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS How I Caught and Fixed an N+1 Query in My Django REST API I got tired of paying $10/month to remove image backgrounds – so I built it for free How to Start Coding as a Student: A Complete Beginner’s Guide 🚀 Storing Kamal secrets in AWS Secrets Manager and deploying to a cheap Hetzner VPS What Are Buffers? Build AI Agents with Hot Dev The Client Onboarding Checklist That Prevents 90% of Project Problems Scalable Treasure Hunts Are a Myth, But We Almost Made One Gemini 3.5 Flash Has a 1M Token Context Window. Here's What You Can Actually Build With It. I built a ultra-polished developer portfolio template using React & Tailwind v4 (with zero-JSX configuration) Gemini CLI Is Dead. Here's the Better Thing That Replaced It Post-quantum cryptography for embedded and IoT: secure boot, TLS and OTA Understanding Optimistic Preloading in Modern Applications Nobody Wants to Read Your Code (And You Don't Want to Read Theirs) A clothing pairing app E2B vs E4B vs 31B Dense: The Practical Guide to Choosing the Right Gemma 4 Model I built an AI app store screenshot generator because Figma made me cry — looking for brutal feedback Hello DEV Community — My Developer Journey Begins Adaptable apps on ChromeOS: a post-mortem The WordPress Paradox: Why It’s Here to Stay (and How to Stop Ruining It) I built a local voice AI that can change to 9 different personalities! UXRay: I Built an AI That Roasts Your UI Like a Senior Designer Would Wyrly DI: Type-safe Dependency Injection for Modern TypeScript The contract is the interface: agent-driven Steampipe Stave in one command Gemma 4's Hidden Superpower: Why Built-in Thinking Tokens Change Everything for Evaluation Tasks ⚡ WordPress Performance: The Real Truth They Don't Tell You A Mobile App Usually Needs an Admin System First Customer Portals Should Remove Repeated Admin Work Episode 4: The Time Loop (Layers & Caching) I Built ContextForge with Gemma 4: A Project Memory Generator for Developers and AI Coding Agents Why shadow DOM beat iframe for inline tooltips HOW TO CREATE USER AND ASSIGN ROLES IN AZURE WITH ENTRA ID When AI Blackmail Goes Viral Episode 3: The Secret Scroll (The Dockerfile) Monte Carlo Simulation for Engineers: Turning Uncertainty Into Numbers The tokens-per-byte trap: character-level 'compression' adds tokens Nobody Reads Your Code Anymore Why I built a collection of 5 free, zero-signup career finance tools for solo builders 🚀 New React Challenge: Instant UI with useOptimistic Resolvendo a Alucinação da IA na Arquitetura de Software com Code Property Graphs e .NET 9 S1 — Clean Backtrace Crashes: How to Diagnose and Fix Them Cómo solucionar el bucle infinito en useEffect con objetos y arrays The Brutal Reality of Running Gemma 4 Locally I made Claude Code refuse to write code unless the ticket scores 80/100 I Fed React's Entire Hooks Transition History to Gemma 4. Here's What It Found That We Missed. Building a Private RAG System: Lessons from a Local-First AI Journal CodePulse AI — Reviving an AI-Powered Repository Intelligence Platform How to Split Video into Segments with FFmpeg (CLI + API) I've audited dozens of estate agency websites. The same 5 problems show up every single time. Part 1: Taming Asynchronous JavaScript: How to Build a "Mailbox" Queue Building My AI-Powered VS Code Extension 🚀 Google Login in Express with PassportJS & JWT Great example of Gemma 4 moving beyond chatbots into real-world decision support. Using AI to guide everyday actions like recycling shows how impactful applied LLMs can be when designed for usability, not just capability. #Gemma4 #AI #Sustainability Building a Production AI Chatbot for an Educational Institute: Architecture, Lessons & Full Stack Deep-Dive Google Login in Express with PassportJS & JWT How I reclaimed 47GB on my MacBook by cleaning developer project junk Operators Are Not Oracles: How We Learned to Stop Worrying and Love the Configuration I Built 6 Free Developer Tools for AI APIs, Cron, Docker, and Self-Hosting How I Built a Real-Time Precious Metals Price Feed for 30,000 Concurrent Users in Laravel How to Use a SERP API to Validate Whether a Project Idea Is Worth Building Gemma 4 discussions often focus on capability, but real-world impact depends on deployment context. For offline education, especially in low-connectivity regions, latency, cost, and local inference matter as much as model strength. Local Mind Explores it Space Complexity + Ω and Θ Notations Google I/O 2026 Just Confirmed the Shift From AI Chatbots to AI Agents How to Add API Monitoring to an Express App in 5 Minutes (2026) Designing an In-Game Inflation Tracking Algorithm for Web Utility Apps Google AI Studio Just Changed the Shape of App Development If you struggle to learn then this is for you. Best AI Agent Security & Guardrails Tools in 2026: LLM Guard vs NeMo vs Guardrails AI Building Dynamic RBAC in React 19: From Permission Strings to Component-Level Access Control
Stop Trusting Your Accuracy Score: A Practical Guide to Evaluating Logistic Regression Models
Gervais Yao · 2026-05-23 · via DEV Community

"Accuracy lied to you. Here's the complete toolkit—confusion matrix, precision, recall, F1, ROC/AUC, log loss, and cross-validation—that separates models that look good from models that actually work."

You trained your first classifier, ran .score(), and got 97% accuracy. You shipped it. Three weeks later, your fraud team tells you it's catching zero fraudulent transactions.

Sound familiar? You fell into the accuracy trap—and it's the most common mistake from developers moving into ML.

This guide will give you the mental model and the code to evaluate binary classifiers properly. By the end, you'll know which metrics to reach for, when accuracy actively lies to you, how to read a ROC curve, and the seven pitfalls that silently kill production models.

Why Linear Regression Breaks for Classification

Before we get to evaluation, one minute on why we use logistic regression at all—because understanding the limitation it solves makes the evaluation choices clearer.

When you apply linear regression to a yes/no problem, you get predictions like 1.3 or -0.2. These aren't probabilities. They can't be thresholded reliably. And a single outlier in your training set can physically shift your decision boundary by several units:

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Binary labels: 0 or 1
X = np.array([1, 2, 3, 4, 5, 6, 7, 8]).reshape(-1, 1)
y = np.array([0, 0, 0, 0, 1, 1, 1, 1])

model = LinearRegression().fit(X, y)
print(model.predict([[0]]))   # Predicts -0.36 — not a valid probability
print(model.predict([[10]]))  # Predicts 1.36 — also not valid

Enter fullscreen mode Exit fullscreen mode

Linear Regression is performing poorly

Logistic regression fixes this by wrapping the linear combination in a sigmoid function:

σ(z) = 1 / (1 + e^(-z))    where z = β₀ + β₁X₁ + ... + βₙXₙ

Enter fullscreen mode Exit fullscreen mode

Sigmoid Function

The sigmoid squashes any real number into the interval (0, 1), giving you an actual probability. It also models the log-odds of the positive class linearly, which is the statistician's way of saying "we get interpretable coefficients."

Under the hood, the model is optimized with Maximum Likelihood Estimation, minimizing cross-entropy loss (not squared error). The decision boundary is linear—a straight line in 2D feature space—but the output is a calibrated probability.

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=1000, n_features=10,
    n_informative=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Note: C is the inverse of regularization strength (C = 1/λ)
# Smaller C = stronger regularization = less overfitting
model = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Hard class predictions
y_pred = model.predict(X_test)

# Calibrated probabilities — use these for most evaluation tasks
y_prob = model.predict_proba(X_test)[:, 1]

Enter fullscreen mode Exit fullscreen mode

Quick note on regularization: LogisticRegression in scikit-learn uses L2 regularization by default (penalty='l2'). Use penalty='l1' with solver='liblinear' if you want automatic feature selection via sparsity.

The Problem: Accuracy Actively Misleads You on Imbalanced Data

Here's the scenario that trips up almost everyone.

Fraud detection dataset:

  • 10,000 transactions
  • 9,900 legitimate (99%)
  • 100 fraudulent (1%)

Build a model that predicts "legitimate" for every single transaction:

import numpy as np
from sklearn.metrics import accuracy_score

y_true = np.array([0]*9900 + [1]*100)
y_dummy = np.zeros(10000)  # Predicts "not fraud" always

print(f"Dummy accuracy: {accuracy_score(y_true, y_dummy):.1%}")
# Output: Dummy accuracy: 99.0%

Enter fullscreen mode Exit fullscreen mode

Dummy accuracy

Your "model" achieves 99% accuracy and catches zero fraud cases.

This isn't a gotcha edge case—it's the normal situation in fraud detection, medical diagnosis, churn prediction, and anomaly detection. Whenever your classes are imbalanced, accuracy is nearly useless as a primary metric.

The root problem: accuracy treats all errors as equal. But missing a fraudulent transaction (false negative) is catastrophically different from flagging a legitimate one (false positive). You need metrics that distinguish between error types.

The Confusion Matrix: Your Evaluation Foundation

Everything useful in binary classification evaluation flows from the confusion matrix—a 2×2 breakdown of where your predictions agree and disagree with reality.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap='Blues')
plt.title("Confusion Matrix")
plt.show()

Enter fullscreen mode Exit fullscreen mode

Confusion Matrix

scikit-learn convention (rows = actual, columns = predicted):

Predicted: Negative (0) Predicted: Positive (1)
Actual: Negative True Negative (TN) ✅ False Positive (FP) ❌
Actual: Positive False Negative (FN) ❌ True Positive (TP) ✅

Plain English:

  • True Positive (TP): You predicted fraud. It was fraud.
  • True Negative (TN): You predicted legit. It was legit.
  • False Positive (FP): You cried wolf. Customer was innocent. (Type I error)
  • False Negative (FN): You missed the fraudster. (Type II error)

⚠️ Heads up: Some textbooks and tools swap the axis convention. When reading someone else's confusion matrix, always check the axis labels before drawing conclusions.

Precision, Recall, F1, and Specificity

Once you have the confusion matrix, every classification metric is just arithmetic on those four numbers.

Precision: "When I fire, do I hit?"

Precision = TP / (TP + FP)

Enter fullscreen mode Exit fullscreen mode

Of all the positives you predicted, what fraction were actually positive? High precision means you rarely raise false alarms.

Reach for precision when false positives are expensive: spam filtering (you don't want to delete legitimate emails), content moderation (you don't want to wrongly remove posts).

Recall (Sensitivity): "Do I catch everything?"

Recall = TP / (TP + FN)

Enter fullscreen mode Exit fullscreen mode

Of all the positives that actually exist, what fraction did you catch? High recall means you miss very few real positives.

Reach for recall when false negatives are dangerous: cancer screening (missing a tumor is catastrophic), fraud detection (missing fraud costs money), churn (missing a leaving customer means lost revenue).

The Unavoidable Trade-off

Lower your classification threshold → you predict positive more often → recall goes up, precision goes down. Raise it → fewer positive predictions → precision goes up, recall goes down. They move in opposite directions; there's no free lunch.

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)

plt.figure(figsize=(8, 5))
plt.plot(thresholds, precisions[:-1], label='Precision', color='blue')
plt.plot(thresholds, recalls[:-1], label='Recall', color='red')
plt.xlabel('Classification Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall Trade-off vs Threshold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Enter fullscreen mode Exit fullscreen mode

Precision-Recall Trade-off vs Threshold

F1 Score: Balancing Both

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Enter fullscreen mode Exit fullscreen mode

The F1 score is the harmonic mean of precision and recall. Unlike the arithmetic mean, it punishes imbalance: a model with precision=1.0 and recall=0.0 gets an F1 of 0.0, not 0.5. Both have to be high to score well.

Use F1 when you need a single headline number and care roughly equally about precision and recall. It's especially useful for comparing models on imbalanced datasets.

Specificity (True Negative Rate): The Clinical Counterpart

Specificity = TN / (TN + FP)

Enter fullscreen mode Exit fullscreen mode

The flip side of recall, but for negatives. "Of all actual negatives, how many did I correctly rule out?" Common in medical contexts:

  • High recall (sensitivity): Use for initial screening—catch every possible case.
  • High specificity: Use for confirmatory testing—avoid false diagnoses.
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=['Legit', 'Fraud']))

Enter fullscreen mode Exit fullscreen mode

Report

🔑 Read classification_report carefully. The accuracy row at the bottom tells you almost nothing here. Look at the per-class precision, recall, and F1 for your minority class.

Choosing the Right Metric for Your Situation

Here's the decision framework I use before I even start training:

Question Answer → Use
Is the dataset imbalanced? Yes Precision / Recall / F1 / PR-AUC
No Accuracy is acceptable as a secondary metric
FP costly, FN cheap? Yes Optimize Precision
FN costly, FP cheap? Yes Optimize Recall
Both costly? Yes F1 or cost-weighted metric
Need threshold-independent comparison? Yes AUC-ROC or AUC-PR

For fraud, churn, and disease: optimize recall first, then set a precision floor your business can tolerate. For spam filters and recommendation engines: optimize precision, accept some misses.

ROC Curve and AUC: Threshold-Independent Evaluation

All the metrics above assume a fixed decision threshold (typically 0.5). But the right threshold depends on your business context and changes as requirements evolve. How do you compare two models before you've even decided on a threshold?

Enter: the ROC (Receiver Operating Characteristic) curve.

ROC plots True Positive Rate (Recall) on the Y-axis against False Positive Rate on the X-axis, across every possible threshold. Each point on the curve is one threshold value.

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)

plt.figure(figsize=(7, 6))
plt.plot(fpr, tpr, color='steelblue', lw=2, label=f'ROC Curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', lw=1, label='Random Guessing (AUC = 0.5)')
plt.fill_between(fpr, tpr, alpha=0.15, color='steelblue')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Recall / Sensitivity)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Enter fullscreen mode Exit fullscreen mode

ROC Curve

AUC: Reading the Number

AUC (Area Under the ROC Curve) condenses the entire curve into one number that tells you how well your model ranks positives above negatives.

AUC Value Interpretation
1.0 Perfect — model always ranks positives above negatives
0.9 – 1.0 Outstanding
0.8 – 0.9 Excellent
0.7 – 0.8 Acceptable
0.5 Random guessing — the model has no discriminative ability
< 0.5 Worse than random (flip predictions to get > 0.5)

The AUC has a beautiful probabilistic interpretation: it equals the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example by your model.

When to Ditch ROC for the Precision-Recall Curve

ROC curves can be overly optimistic on severely imbalanced datasets. Why? FPR (False Positive Rate = FP / (FP + TN)) has the large TN count in the denominator. When there are thousands of true negatives, even many false positives produce a tiny FPR—making your ROC curve look good while your precision is terrible.

Rule of thumb:

  • Balanced classes → ROC/AUC is reliable.
  • Heavy class imbalance → Use the Precision-Recall curve and AUC-PR instead.
from sklearn.metrics import average_precision_score, PrecisionRecallDisplay

ap = average_precision_score(y_test, y_prob)
display = PrecisionRecallDisplay.from_predictions(
    y_test, y_prob, name=f"AP = {ap:.3f}"
)
display.ax_.set_title("Precision-Recall Curve")
plt.show()

Enter fullscreen mode Exit fullscreen mode

Precision-Recall Curve

ROC Curve vs Precision-Recall Curve

Log Loss: The Probabilistic Metric You Should Be Using More

Accuracy, precision, and recall all evaluate hard predictions (the 0/1 decision). But your model produces probabilities, and evaluating only the binary output throws away information.

Log loss (cross-entropy) measures how well-calibrated your probability estimates are:

Log Loss = -(1/n) × Σ [y_i × log(p_i) + (1 - y_i) × log(1 - p_i)]

Enter fullscreen mode Exit fullscreen mode

In plain terms: predict 0.99 probability for a positive that turns out to be negative, and you're penalized harshly. Predict a confident 0.60 instead of 0.51, and you get a better log loss even if both produce the same hard prediction.

Log loss is preferred when:

  • Downstream systems consume probabilities, not labels (e.g., expected value calculations)
  • You're comparing two models that produce identical accuracy/F1 but different calibration
  • You're using the output to set a custom business threshold
from sklearn.metrics import log_loss

ll = log_loss(y_test, y_prob)
print(f"Log Loss: {ll:.4f}")
# Perfect model: 0.0
# Random guessing: ln(2) ≈ 0.693

Enter fullscreen mode Exit fullscreen mode

Log Loss

Lower log loss = better calibrated probabilities. A model with log loss > 0.693 is effectively worse than random probability assignment.

Cross-Validation: Getting Evaluation You Can Trust

Single train/test splits are noisy. If you got lucky (or unlucky) with how data was randomly partitioned, your metrics don't generalize. Cross-validation gives you a reliable estimate.

k-Fold Cross-Validation

Split data into k folds. Train on k-1 folds, test on the remaining fold. Repeat k times (once per fold as the test set). Average the k results.

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Always put preprocessing inside the pipeline!
# This prevents data leakage from the scaler.
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(C=1.0, max_iter=1000, random_state=42))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate multiple metrics
for metric in ['accuracy', 'f1', 'roc_auc', 'neg_log_loss']:
    scores = cross_val_score(pipe, X, y, cv=cv, scoring=metric)
    label = metric.replace('neg_', '-')
    print(f"{label:>15}: {scores.mean():.4f} ± {scores.std():.4f}")

Enter fullscreen mode Exit fullscreen mode

       accuracy: 0.8920 ± 0.0145
             f1: 0.8911 ± 0.0163
        roc_auc: 0.9587 ± 0.0098
      -log_loss: -0.2734 ± 0.0121

Enter fullscreen mode Exit fullscreen mode

Why Stratified, Not Regular k-Fold?

With imbalanced classes, a random split might put almost all the minority class examples in one fold—making some folds impossible to evaluate meaningfully.

StratifiedKFold preserves the class ratio in each fold. Use it by default for classification, especially with imbalanced data. It's almost always the right choice.

Cross-Validation

7 Pitfalls That Will Silently Break Your Evaluation

1. Reporting Only Accuracy

Symptom: Your model scores 97% accuracy and gets shipped. It catches nothing useful.

Fix: Always report precision, recall, F1, and AUC alongside accuracy. If classes are imbalanced, accuracy is a secondary metric at best.

2. Data Leakage in Preprocessing

Symptom: Suspiciously high validation metrics that don't hold up in production.

Cause: Fitting your scaler, imputer, or feature selector on the full dataset before splitting, letting test-set information influence your transforms.

# ❌ WRONG — scaler sees test data, leaks information
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)           # fit on everything
X_train, X_test = train_test_split(X_scaled) # split afterward

# ✅ CORRECT — use a pipeline or manually split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit on train only
X_test_scaled = scaler.transform(X_test)        # transform test with train params

Enter fullscreen mode Exit fullscreen mode

3. Using Default 0.5 Threshold Without Questioning It

Symptom: Good AUC, terrible precision or recall in production.

Fix: Find the threshold that matches your business cost ratio, then tune it on a validation set.

# Find threshold that maximizes F1 on validation data
from sklearn.metrics import f1_score

best_f1, best_threshold = 0, 0.5
for threshold in np.arange(0.1, 0.9, 0.01):
    y_pred_t = (y_prob >= threshold).astype(int)
    f = f1_score(y_test, y_pred_t)
    if f > best_f1:
        best_f1 = f
        best_threshold = threshold

print(f"Best threshold: {best_threshold:.2f}, F1: {best_f1:.4f}")

Enter fullscreen mode Exit fullscreen mode

4. Ignoring Class Imbalance in CV

Symptom: Cross-validation folds have inconsistent class distributions; some folds fail or give wild metric swings.

Fix: Use StratifiedKFold (shown above). Also consider class_weight='balanced' in your model:

model = LogisticRegression(class_weight='balanced', max_iter=1000)

Enter fullscreen mode Exit fullscreen mode

5. Evaluating on the Test Set More Than Once

Symptom: You iterate by checking test metrics, making changes, re-checking—and unknowingly over-fit to the test set.

Fix: Use a three-way split or cross-validation for development; touch the test set exactly once for final reporting.

# Three-way split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)
# Train on X_train, tune on X_val, final eval on X_test (once!)

Enter fullscreen mode Exit fullscreen mode

6. Over-Relying on AUC-ROC with Severe Imbalance

Symptom: ROC-AUC looks great; actual fraud/disease detection rate is awful.

Fix: Switch to AUC-PR (average_precision_score) for heavily imbalanced problems.

7. Skipping a Baseline Comparison

Symptom: You report 0.87 AUC with no context.

Fix: Always compare against a dummy baseline.

from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy='most_frequent')
baseline_auc = cross_val_score(dummy, X, y, cv=5, scoring='roc_auc').mean()
model_auc = cross_val_score(pipe, X, y, cv=5, scoring='roc_auc').mean()

print(f"Baseline AUC: {baseline_auc:.3f}")
print(f"Model AUC:    {model_auc:.3f}")
print(f"Improvement:  {model_auc - baseline_auc:+.3f}")

Enter fullscreen mode Exit fullscreen mode

The Complete Evaluation Workflow

Here's what we should actually run before declaring a model ready:

from sklearn.metrics import (
    classification_report, confusion_matrix, ConfusionMatrixDisplay,
    roc_auc_score, average_precision_score, log_loss, f1_score
)

def evaluate_classifier(model, X_test, y_test, threshold=0.5):
    y_prob = model.predict_proba(X_test)[:, 1]
    y_pred = (y_prob >= threshold).astype(int)

    print("=" * 50)
    print(f"EVALUATION REPORT (threshold = {threshold})")
    print("=" * 50)

    # 1. Classification report (precision, recall, F1 per class)
    print("\n--- Per-Class Metrics ---")
    print(classification_report(y_test, y_pred))

    # 2. Probabilistic metrics
    print(f"ROC-AUC:       {roc_auc_score(y_test, y_prob):.4f}")
    print(f"PR-AUC:        {average_precision_score(y_test, y_prob):.4f}")
    print(f"Log Loss:      {log_loss(y_test, y_prob):.4f}")

    # 3. Confusion matrix
    print("\n--- Confusion Matrix ---")
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(cm)
    disp.plot(cmap='Blues')
    plt.title(f"Confusion Matrix (threshold={threshold})")
    plt.tight_layout()
    plt.show()

evaluate_classifier(model, X_test, y_test, threshold=0.5)

Enter fullscreen mode Exit fullscreen mode

Practical Takeaways Checklist

Before you call any classifier production-ready:

  • [ ] Look at the confusion matrix first. Numbers before plots.
  • [ ] Report precision, recall, and F1 for the minority class, not just overall accuracy.
  • [ ] Use StratifiedKFold cross-validation to get reliable metric estimates.
  • [ ] Compare ROC-AUC and PR-AUC. If classes are imbalanced, PR-AUC is your primary signal.
  • [ ] Check log loss to verify your probabilities are well-calibrated, not just your hard predictions.
  • [ ] Question the 0.5 threshold. Tune it to match the real cost of FP vs. FN in your domain.
  • [ ] Use a Pipeline to prevent data leakage from preprocessing steps.
  • [ ] Run a DummyClassifier baseline before celebrating your AUC score.
  • [ ] Reserve your test set. If you've looked at it more than once during development, it's a validation set.
  • [ ] Tie your metric choice to a business outcome. "We want to catch 90% of churners while maintaining > 60% precision" beats "maximize F1."

Final Thought

Model evaluation isn't about finding the best model in the abstract—it's about finding the right model for your specific problem. A 95% accuracy model can be completely useless. An 80% accuracy model can save lives or prevent fraud, depending on where it's wrong.

The metrics are just tools. The judgment—knowing which errors your system can tolerate and which it can't—is what makes you a useful engineer, not just a code runner.

Go measure wisely.

Found this useful? I'd love to hear which pitfall stung you hardest—drop it in the comments.