惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
小众软件
小众软件
WordPress大学
WordPress大学
博客园 - 司徒正美
人人都是产品经理
人人都是产品经理
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
A
Arctic Wolf
The Last Watchdog
The Last Watchdog
SecWiki News
SecWiki News
S
Security Affairs
博客园 - 【当耐特】
宝玉的分享
宝玉的分享
N
News and Events Feed by Topic
Apple Machine Learning Research
Apple Machine Learning Research
Last Week in AI
Last Week in AI
AI
AI
S
Security @ Cisco Blogs
P
Proofpoint News Feed
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
Scott Helme
Scott Helme
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
NISL@THU
NISL@THU
Cisco Talos Blog
Cisco Talos Blog
Cloudbric
Cloudbric
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
F
Full Disclosure
T
The Exploit Database - CXSecurity.com
云风的 BLOG
云风的 BLOG
Spread Privacy
Spread Privacy
Attack and Defense Labs
Attack and Defense Labs
有赞技术团队
有赞技术团队
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
L
LangChain Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
V
Vulnerabilities – Threatpost
F
Fortinet All Blogs
博客园 - 叶小钗
E
Exploit-DB.com RSS Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
N
News | PayPal Newsroom
T
Tenable Blog
Recent Commits to openclaw:main
Recent Commits to openclaw:main
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
T
The Blog of Author Tim Ferriss
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
The Register - Security
The Register - Security
Recent Announcements
Recent Announcements

Hacker News - Newest: "AI"

AI can't read an investor deck AI as an attorney? Student uses ChatGPT, Gemini to sue UW over alleged racial discrimination Hacking MCP Servers in AI Systems – The Rug Pull: Tool Changes After Approval GitHub - MeepCastana/KubeezCut: Free Web based video editor GitHub - GenAI-Gurus/awesome-eu-ai-act: Curated tools, official sources, OSS, templates, and guides for EU AI Act compliance. Can AI judge journalism? A Thiel-backed startup says yes, even if it risks chilling whistleblowers Coming soon: 10 Things That Matter in AI Right Now DARPA built an AI to fact-check enemy weapons claims What explains heterogeneity in AI adoption? When AI Meets Muscle: Context-Aware Electrical Stimulation Promises a New Way to Guide Human Movements - Department of Computer Science AI Changed How We Build. It Did Not Change What Matters. Linux rules on using AI-generated code - Copilot is OK, but humans must take 'full responsibility for the… Meta spins up AI version of Mark Zuckerberg to engage with employees Code Mode: Let Your AI Write Programs, Not Just Call Tools | TanStack Blog GitHub - Delavalom/graft: Go framework for building AI agents. Type-safe tools, multi-provider (OpenAI, Anthropic, Gemini, Bedrock), zero vendor SDKs. India's TCS tops estimates, says new AI models did not dent services demand Gen Z's fading AI hype Strong feeling: we are in a folded AI reality GitHub - machinarii/total-recall-catalog: A reference catalog of latest knowledge retrieval, memory & RAG systems GitHub - mensfeld/code-on-incus: Give each AI agent its own isolated machine with root, Docker, and systemd. Active defense detects and stops threats automatically.. Quantization, LoRA, and the 8% Problem: Benchmarking Local LLMs for Production AI Iran war: We spoke to the man making Lego-style AI videos that experts say are powerful propaganda Powell, Bessent discussed Anthropic's Mythos AI cyber threat with major U.S. banks GitHub - immartian/bellamem: Persistent belief-graph memory for AI agents. Retrieves decisive context by importance — not recency, not RAG, not /compact. recursive-mode: The Repo-Native Operating System for AI Engineering After the attack on Sam Altman's home, will AI CEO's go on the offensive? The biggest advance in AI since the LLM Opus 4.6 vs GPT 5.4 One Prompt Unity World Generation Test “AI polls” are fake polls Client Challenge Can AI be a 'child of God'? Inside Anthropic's meeting with Christian leaders How to Switch AI Chatbots and Why You Might Want To GitHub - MattMessinger1/agentic_refund_guardrail: Safe refund policy layer for AI agents — Python + TypeScript. Same behavior, shared tests. Adam/papers/emergent_values_whitepaper.md at master · strangeadvancedmarketing/Adam Ask HN: How do you stop playing 20 questions with your AI coding tools How far can automation and AI support psychotherapy? - @theU GitHub - stagas/rtdiff: realtime git diff gui and AI-assisted commits A Mac Studio for Local AI — 6 Months Later A History of the Early Years of AI at the University of Edinburgh Why AI Coding Tools Still Feel Stuck on Localhost MSN AI Datacenters Are Becoming Strategic Targets twitter.com Penn Researchers Use AI to Surface Unreported GLP-1 Side Effects in Reddit Posts Show HN: MoodSense AI (ML and FastAPI and Gradio, Deployed on Hugging Face) Moodsense Ai - a Hugging Face Space by aman179102 AI models are terrible at betting on soccer—especially xAI Grok GitHub - xialeistudio/echoic GitHub - HimashaHerath/github-dev-wrapped: AI-powered weekly GitHub activity reports deployed to GitHub Pages GitHub - alejandrobalderas/claude-code-from-source: Architecture, patterns & internals of Anthropic's AI coding agent — reverse-engineered from source maps AI and Tech brief: Ireland ascendant GitHub - Titovilal/context0: Context0 - Never Surrender Training for a Marathon with an AI Coach: What Worked and What Didn't Cyber Pulse: Agentic Intel - Apps on Google Play I Built an AI PR Reviewer That Catches Bugs by Not Looking for Bugs Gen Z workers are so fearful AI will take their job they’re intentionally sabotaging their company’s AI rollout | Fortune How AI Is Reimagining the Game of Golf–For Both Players and Courses GitHub - nattergabriel/reseed: A CLI tool for managing and distributing agent skills across projects Is SVG the final frontier? My AI workflow evolved from prompts to a near-autonomous workflow MLSharp Help - 3DGS Viewer & Generator I put my cognitive field based AI's runtime on GitHub Is Numble the first AI-proof game? A3: Kubernetes for autonomous AI agent fleets | Emergent Principles Deepali Vyas ("The Elite Recruiter") GitHub - msmarkgu/RelayFreeLLM: A restful API designed to route user prompts to various AI model providers. Unionized ProPublica staff are on strike over AI, layoffs, and wages Unleashing the Advantage of Quantum AI We're heading for an AI-fueled 'dementia crisis,' brain scientist warns The AI-Assisted Breach of Mexico's Government Infrastructure [pdf] GitHub - stef41/lmscan: 🔍 Detect AI-generated text and fingerprint which LLM wrote it. Open-source GPTZero alternative. Zero dependencies, works offline. MSN GitHub - visionscaper/collabmem: Enabling long-term collaboration with Agentic AI - building up episodic and world model memory over time with in-context awareness We gave an AI a 3 year retail lease in SF and asked it to make a profit | Andon Labs AI Code is Hollowing Out Open Source, and Maintainers are Looking the Other Way What leaked "SteamGPT" files could mean for the PC gaming platform's use of AI AI is the boss at this retail store. What could go wrong? GitHub - Wuzu11517/agentic-proxy: Local proxy meant to help reduce With Drones, Geophysics and ArtificiaI Intelligence, Researchers Prepare to Do Battle Against Land Mines A Single Operator, Two AI Platforms, Nine Government Agencies: The Full Technical Report 在 Steam 上购买 FriedrichAI: Offline AI 立省 10% GitHub - inevolin/resume-cli: Hit Claude usage limits? Resume any AI coding session elsewhere. Switch tools at zero friction. GitHub - atripati/ark: AI Runtime Kernel — a context operating system for AI agents. Eliminates tool bloat, loads only what’s needed, and gives LLMs their reasoning space back. How to Build a Secure AI PR Reviewer with Claude, GitHub Actions, and JavaScript This Startup Wants You to Pay Up to Talk With AI Versions of Human Experts Intel Arc Pro B70 Brings 32GB VRAM to Local AI for $949 WordPress 7.0: The Good, the AI, and the Still Missing AI on the couch: Anthropic gives Claude 20 hours of psychiatry IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures AI Agents Know About Supabase. They Don't Always Use It Right. The history and future of AI at Google, with Sundar Pichai Inside an AI‑enabled device code phishing campaign How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines AI for Systems: Using LLMs to Optimize Database Query Execution Forecasting the Economic Effects of AI Introducing Tinker: Play with AI, bring your ideas to life AI sheds light on an ancient gaming mystery People really hate AI but not as much as Iran—or Democrats | Fortune What is an AI Product Engineer? Phoebe Gates wants her $185 million AI startup to succeed with 'no ties to my privilege or my last name': 'I have a chip on my shoulder' | Fortune
AI Metrics — A Learning Toybox
bignet · 2026-06-15 · via Hacker News - Newest: "AI"

AI Metrics, visually!

The metrics you meet when finetuning a model — grouped by when you use them, made playful and interactive.

🗺️ The map. Finetuning has three moments, each with its own metrics:
While training → loss & perplexity (is it learning? is it overfitting?)
Judging labels → accuracy, precision, recall, F1 (did it classify right?)
Judging generated text → ROUGE, BLEU, BERTScore (did it write well?)
The sections below follow that order. Everything ties back to the fishing net.

The one mental model

🎣 Your model is a fishing net

Every metric here is just an anxious question someone asks about that net. Hold this picture and the rest follows.

🐟 🐟 🐟 🥾 🐟 🐟

Precision

"Everything I caught — is it actually fish, or did I also pull up boots?"

Of what I flagged positive, how much was right? Boots in the net (🥾) = false positives.

Recall

"All the fish in the lake — did I actually catch them, or did some swim through?"

Of what was really there, how much did I get? Escaped fish = false negatives.

Why they pull against each other: want perfect recall? Cover the whole lake with your net — you'll get every fish, but also every boot (so precision drops). Want perfect precision? Take only the one fish you're 100% sure of — definitely a fish, but you missed 999 others (so recall drops). F1 exists to stop both shortcuts.

① While training · the overfitting alarm

📉 Loss, validation loss & perplexity

Before judging quality, you watch whether the model is learning at all. During finetuning your dashboard shows loss dropping over time. The single most important habit for a beginner: watch the validation curve, not the training curve.

Drag through training epochs:

Train loss

model fit to training data

Val loss

fit to unseen data — the real signal

What is perplexity?

perplexity = e^loss. Read it as: "how many equally-likely words is the model choosing between at each step?" Perplexity 1 = perfectly certain & correct. Perplexity 50 = as unsure as picking from 50 options. Lower is better.

How to spot overfitting

Train loss always keeps dropping — the model can memorize. When val loss turns back upward while train loss falls, it's memorizing, not learning. That gap is your signal to stop early.

① While training · the headline number for language models

🎲 Perplexity & cross-entropy, in plain terms

🌤️ Think of a weather forecaster. Every day they predict tomorrow. A language model does the same — but predicts the next word.

Perplexity = how many options is the forecaster still guessing among?
• "100% sure it'll rain" → and it rains → only 1 option in play. Perfect. Perplexity = 1.
• "Could be sunny, rainy, cloudy, or snowy — no idea" → 4 options in play. Perplexity = 4.
• Guessing blindly → thousands of options in play. Perplexity = huge.
Fewer options = more confident and correct, so lower is better. The purple "doors" further down literally draw these options.

😱 And cross-entropy loss? It's just a "surprise meter." The model bets a probability on each word; then the true word is revealed:
• Bet 90% on the right word → "I expected that" → tiny surprise.
• Bet 50% → "fair enough" → medium surprise.
• Bet 1% → "wait, WHAT?!" → huge surprise.
Cross-entropy loss = the model's average surprise across all the words. Low surprise = good. That's the entire idea — the rest is just the math for "surprise."
Connection: perplexity = e^(loss) — loss is the raw training signal; perplexity is that same thing translated into "number of choices" you can picture. Like °C vs an obscure unit: same temperature, one you can feel.

Now the same idea with real numbers. A language model's job: given the words so far, predict the next one — a probability for every word in its vocabulary. Perplexity asks:

"On average, how many words is the model unsure between at each step?"

thecatsatonthe → next word ismat

The true next word is mat. Drag how much probability the model gave it:

P("mat") =

The remaining probability is spread over other words. Here's the model's guess distribution:

Probability of correct word

Surprise (loss) = −ln(p)

Why exponentiate the loss?

Training shows cross-entropy loss = average surprise, in abstract units (nats). perplexity = e^loss converts that surprise back into a tangible count of choices. Loss 0 → ppl 1. Loss 2.3 → ppl ≈ 10. Loss 4.6 → ppl ≈ 100. Same info, friendlier units.

What's a "good" number?

It depends on vocabulary & task, so it's only meaningful relative to a baseline. A strong modern LLM on general English sits around perplexity 3–15. Random guessing across a 50k vocab ≈ 50,000. You use it to compare: did finetuning lower my perplexity on my domain's text?

One important limitation: perplexity only measures next-word prediction on text you already have. It says nothing about whether answers are helpful, true, or well-formatted — that's what the §③ generation metrics and LLM-as-judge are for.

② Judging labels · build it by hand

The confusion matrix

Below are 12 items. The emoji is the truth: 🐟 is actually positive, 🥾 is actually negative. Click an item to toggle your model's prediction (a blue ring = "model predicts positive / caught in net"). Watch the matrix and metrics update.

Blue ring = predicted positive. Try to catch all the fish without grabbing boots.

Predicted Positive

Predicted Negative

Actually 🐟

0True Positive (caught fish)

0False Negative (fish escaped)

Actually 🥾

0False Positive (caught a boot)

0True Negative (left boot alone)

② Judging labels · the trade-off

Precision vs Recall: the slider

Real models output a score (0–1), and you pick a threshold: score ≥ threshold → predict positive. Below, 14 items have fixed scores. Drag the threshold and watch precision and recall pull in opposite directions.

Decision threshold:

Low threshold = cast a wide net (high recall, low precision). High threshold = only the sure things (high precision, low recall).

Bold = actually positive (🐟). Blue outline = predicted positive at this threshold.

② Judging labels · the classic beginner trap

⚠️ Why "accuracy" lies

Accuracy = "what fraction did I get right?" — the most intuitive metric, and the most dangerous on imbalanced data. Here's a fraud detector where only a few transactions are actually fraud. Drag how rare fraud is:

Fraud rate in the data:

Now meet the "lazy model" that just predicts "never fraud" for everything — it does zero real work:

Recall (on fraud)

caught 0 of the fraud

This is the whole reason precision, recall & F1 exist: they ignore the easy majority class and ask "did you catch the thing that matters?" A model can have 98% accuracy and be useless.

② Reference

The formulas, with their job

Precision

TP / (TP + FP)

"Of my catch, how much is fish?"

Use when false alarms are costly. Spam filter — blocking real email is worse than missing some spam.

Recall

TP / (TP + FN)

"Of all the fish, how many did I catch?"

Use when misses are costly. Cancer screening — a false alarm beats missing a sick patient.

F1

2·P·R / (P + R)

Harmonic mean — dies if either is low.

Use when you need balance and don't want a high score from maximizing just one side.

Why harmonic mean, not regular average? Regular average of P=1.0, R=0.0 is 0.5 (looks okay!). Harmonic mean is 0.0 — it refuses to reward a model that ignores half the problem.

③ Judging generated text · metrics for words

ROUGE = the same idea, for words

When a model generates text (summaries, translations), the "fish in the lake" become the words in the reference. ROUGE asks: how much of the reference did the generated text catch?

Reference (the ideal / "truth")

Generated (your model's output)

ROUGE-1 (words)

ROUGE-2 (pairs)

ROUGE-L (sequence)

Why ROUGE-2 / ROUGE-L matter: try Reference dog bites man and Generated man bites dog. ROUGE-1 says perfect (same 3 words!) — but the meaning is reversed. ROUGE-2 (word pairs) and ROUGE-L (order) catch what ROUGE-1 misses.

③ Judging generated text · the family

ROUGE vs BLEU — two sides of the same idea

ROUGE and BLEU both count word overlap; they just lead with different anxieties from the fishing net:

ROUGE → recall-leaning

"Did I cover everything in the reference?"

Standard for summarization — a summary that drops key points is the failure you fear. Misses hurt.

BLEU → precision-leaning

"Is everything I generated actually correct?"

Standard for translation — inventing words that don't belong is the failure you fear. It adds a "brevity penalty" so you can't get a high score by outputting just one perfect word.

Same precision/recall trade-off you saw with the fishing net — just applied to word-chunks, with each field picking the side that matches its worst failure.

③ The limitation & the modern fix

The blind spot: none of these understand meaning

ROUGE and BLEU count surface overlap. To them, "the film was great" and "the movie was excellent" share almost nothing — near-zero score, despite identical meaning. For finetuning a chatbot, that makes them weak judges.

BERTScore

Compares embeddings, not exact words. "film/movie", "great/excellent" score as near-matches. The meaning-aware fix for ROUGE/BLEU's blindness.

LLM-as-a-judge

Ask a strong model (e.g. Claude) to score your finetuned model's answers for helpfulness, correctness, tone. The dominant method today for instruction-tuned models.

Win rate / human eval

"Is answer A better than B?" across many prompts. Pairwise preference is what RLHF and Chatbot-Arena rankings use. Humans remain the gold standard for subjective quality.

The finetuning takeaway: ngram metrics (ROUGE/BLEU) are cheap and automatic — great for a fast signal and for tasks with one right answer. For open-ended chat quality, they undercount good paraphrases, so the field leans on LLM-as-judge and human preference. Match the metric to what failure you actually fear.