惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
T
Tenable Blog
Webroot Blog
Webroot Blog
L
Lohrmann on Cybersecurity
S
Securelist
S
Schneier on Security
NISL@THU
NISL@THU
Know Your Adversary
Know Your Adversary
C
Cybersecurity and Infrastructure Security Agency CISA
T
The Exploit Database - CXSecurity.com
L
LINUX DO - 热门话题
C
CXSECURITY Database RSS Feed - CXSecurity.com
O
OpenAI News
I
Intezer
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
TaoSecurity Blog
TaoSecurity Blog
S
Secure Thoughts
Application and Cybersecurity Blog
Application and Cybersecurity Blog
P
Privacy International News Feed
H
Hacker News: Front Page
N
Netflix TechBlog - Medium
M
MIT News - Artificial intelligence
博客园 - Franky
PCI Perspectives
PCI Perspectives
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Microsoft Azure Blog
Microsoft Azure Blog
MongoDB | Blog
MongoDB | Blog
L
LangChain Blog
P
Proofpoint News Feed
S
Security Affairs
WordPress大学
WordPress大学
The Last Watchdog
The Last Watchdog
S
SegmentFault 最新的问题
小众软件
小众软件
F
Full Disclosure
博客园 - 叶小钗
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
T
The Blog of Author Tim Ferriss
Simon Willison's Weblog
Simon Willison's Weblog
P
Palo Alto Networks Blog
Security Latest
Security Latest
P
Proofpoint News Feed
月光博客
月光博客
T
Tailwind CSS Blog
Scott Helme
Scott Helme
Hacker News - Newest:
Hacker News - Newest: "LLM"
Google Online Security Blog
Google Online Security Blog
T
Threat Research - Cisco Blogs
Help Net Security
Help Net Security
Project Zero
Project Zero

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
XGBoost: When Gradient Boosting Meets Regularization
jacobjerryar · 2026-05-16 · via DEV Community

1. The Problem It Solves

Imagine you’re a loan officer at a bank. You have thousands of past loan applications with features like income, credit score, employment length, and debt-to-income ratio. You need to predict whether a new applicant will default or repay. This is a binary classification problem, but real-world data is messy: missing values, outliers, non-linear relationships, and interactions between features. Many algorithms struggle to handle all of this gracefully without heavy preprocessing. XGBoost (eXtreme Gradient Boosting) was built specifically to solve such tabular prediction problems with high accuracy, speed, and robustness. It’s become the go‑to algorithm for Kaggle competitions and many industry applications, from fraud detection to customer churn prediction.

2. The Core Idea (Intuition First)

Think of a group of friends trying to guess the weight of a cake. The first friend makes a rough guess say, 2 kg. The second friend doesn’t start from scratch; instead, she tries to correct the error of the first guess. If the real weight is 2.5 kg, the error is +0.5 kg, so she predicts +0.5 kg. The third friend corrects the remaining error, and so on. By combining many weak guesses (each slightly better than random), they arrive at a very accurate final estimate.

XGBoost works exactly like that: it builds an ensemble of decision trees sequentially. Each new tree tries to correct the mistakes made by all previous trees combined. But there’s a twist – XGBoost adds regularization to prevent overfitting, and it optimises the whole process to be lightning fast. It’s not just “gradient boosting” – it’s gradient boosting on steroids.

Technically, XGBoost minimises a regularised objective function that balances prediction error (loss) with model complexity. It uses a second‑order Taylor approximation of the loss (like Newton’s method) to guide tree splitting, which is more accurate than the simple gradient used in standard gradient boosting.

Here’s a corrected, dev.to‑friendly version of Section 3. Use standard LaTeX delimiters $$ for display math and $ for inline. Dev.to supports KaTeX, so the following will render cleanly.

3. How It Works (The Math + Logic)

XGBoost builds an ensemble of KK decision trees. For a given prediction y^i\hat{y}_i , it sums the outputs of all trees:

y^i=k=1Kfk(xi),fkF \hat{y}i = \sum{k=1}^{K} f_k(x_i), \quad f_k \in \mathcal{F}

where each fkf_k is a tree (a mapping from features to leaf weights). The algorithm learns the trees one by one to minimise the following objective:

L(t)=i=1n(yi,y^i(t1)+ft(xi))+Ω(ft) \mathcal{L}^{(t)} = \sum_{i=1}^{n} \ell\bigl(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)\bigr) + \Omega(f_t)

  • \ell is a differentiable loss function (e.g., log loss for classification, squared error for regression).
  • y^i(t1)\hat{y}_i^{(t-1)} is the prediction from the previous t1t-1 trees.
  • ftf_t is the new tree we are adding at step tt .
  • Ω(f)=γT+12λj=1Twj2\Omega(f) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^{T} w_j^2 is the regularisation term: TT = number of leaves in the tree, wjw_j = weight (prediction) on leaf jj , γ\gamma and λ\lambda are hyperparameters. This penalises complex trees (many leaves or large leaf weights), reducing overfitting.

XGBoost uses a second‑order approximation of the loss (Newton's method) to make optimisation efficient. For a given tree structure, the optimal leaf weight and the resulting gain from a split are derived analytically. When deciding where to split a node, XGBoost tries every feature and every possible split value, computing the "gain":

Gain=12[GL2HL+λ+GR2HR+λ(GL+GR)2HL+HR+λ]γ \text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L+G_R)^2}{H_L+H_R + \lambda} \right] - \gamma

Here GG = sum of first derivatives (gradients) in a leaf, HH = sum of second derivatives (Hessians). A split is made only if the gain exceeds γ\gamma , which directly prunes leaves.

The algorithm also includes:

  • Column subsampling (like Random Forest) – reduces overfitting and speeds up training.
  • Handling missing values – learns the best direction to send missing values.
  • Weighted quantile sketches – efficiently finds approximate split points for large datasets.

After building KK trees, you have a powerful, regularised ensemble.

4. When to Use It

Best for:

  • Medium‑sized to large tabular datasets (thousands to millions of rows, dozens to hundreds of features).
  • Problems where you need high accuracy without extensive feature engineering – XGBoost can learn non‑linear interactions and handle mixed data types (numeric + categorical, though categorical needs encoding).
  • Situations where interpretability is secondary to performance (you can get feature importance, but a single tree is easier to explain).

Assumptions:

XGBoost makes no strong assumptions about data distribution. It works well even if features are correlated or if there are irrelevant features (thanks to regularisation).

When it fails:

  • Very high‑dimensional sparse data (like text or image pixels) – deep learning usually works better.
  • Small datasets (less than a few hundred rows) – simple models like logistic regression or a single decision tree often outperform and are less prone to overfitting.
  • Real‑time latency‑critical applications – XGBoost prediction is fast, but an ensemble of 100 trees is slower than a linear model. For microsecond latency, consider simpler models or use specialised hardware.
  • Non‑tabular data (images, sequences, graphs) – use CNNs, RNNs, or Graph Neural Networks instead.

My opinion: XGBoost is my first choice for any supervised learning problem on structured data. I’ve seen it beat carefully tuned neural networks on multiple Kaggle competitions. The only reason to not use it is when you desperately need interpretability (then use logistic regression or a single decision tree) or when you have a tiny dataset.

5. Implementation

Below is a complete example using XGBoost for classification on the famous breast cancer dataset. We’ll train a model, evaluate it, and show feature importance.

import xgboost as xgb
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load data
data = load_breast_cancer()
X, y = data.data, data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Create XGBoost classifier
model = xgb.XGBClassifier(
    n_estimators=100,        # number of trees
    max_depth=6,             # maximum tree depth
    learning_rate=0.1,       # step size shrinkage
    subsample=0.8,           # row subsampling
    colsample_bytree=0.8,    # column subsampling per tree
    reg_lambda=1.0,          # L2 regularisation on leaf weights
    reg_alpha=0.0,           # L1 regularisation (optional)
    random_state=42,
    eval_metric='logloss'
)

# Train
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# Feature importance
importance = model.feature_importances_
top_indices = np.argsort(importance)[-5:]   # top 5 features
print("\nTop 5 most important features:")
for idx in top_indices[::-1]:
    print(f"  {data.feature_names[idx]}: {importance[idx]:.3f}")

Enter fullscreen mode Exit fullscreen mode

Output (your exact numbers may vary slightly):

Accuracy: 0.9737

Classification Report:
              precision    recall  f1-score   support
   malignant       0.97      0.97      0.97        42
      benign       0.98      0.98      0.98        72

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Top 5 most important features:
  worst concave points: 0.152
  worst perimeter: 0.121
  worst texture: 0.089
  mean concave points: 0.074
  worst area: 0.068

Enter fullscreen mode Exit fullscreen mode

The model achieves ~97% accuracy on the test set with almost no tuning – that’s the power of XGBoost. You can see which features drove the decision (concave points and perimeter are highly predictive for breast cancer).

6. Key Takeaways

  1. XGBoost is gradient boosting with regularisation and second‑order optimisation – it’s faster, more accurate, and less prone to overfitting than plain gradient boosting. Always try it as a baseline for tabular data.
  2. It handles real‑world messiness well – missing values, outliers, non‑linear relationships, and feature interactions are all taken care of internally, saving you hours of preprocessing.
  3. Hyperparameter tuning matters – start with n_estimators=100, max_depth=6, learning_rate=0.1, then use subsample and colsample_bytree to reduce overfitting. For large datasets, enable the GPU (tree_method='gpu_hist') for massive speedups.