惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
G
GRAHAM CLULEY
P
Privacy & Cybersecurity Law Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
宝玉的分享
宝玉的分享
P
Proofpoint News Feed
H
Help Net Security
V
Visual Studio Blog
阮一峰的网络日志
阮一峰的网络日志
C
Cisco Blogs
人人都是产品经理
人人都是产品经理
Know Your Adversary
Know Your Adversary
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Recorded Future
Recorded Future
I
Intezer
罗磊的独立博客
T
The Exploit Database - CXSecurity.com
Blog — PlanetScale
Blog — PlanetScale
Malwarebytes
Malwarebytes
Spread Privacy
Spread Privacy
T
Tor Project blog
V
Vulnerabilities – Threatpost
云风的 BLOG
云风的 BLOG
腾讯CDC
B
Blog RSS Feed
Stack Overflow Blog
Stack Overflow Blog
F
Future of Privacy Forum
MyScale Blog
MyScale Blog
Latest news
Latest news
IT之家
IT之家
MongoDB | Blog
MongoDB | Blog
The Hacker News
The Hacker News
S
Securelist
博客园 - 【当耐特】
C
CXSECURITY Database RSS Feed - CXSecurity.com
T
Threat Research - Cisco Blogs
Jina AI
Jina AI
Cisco Talos Blog
Cisco Talos Blog
B
Blog
博客园 - 三生石上(FineUI控件)
Last Week in AI
Last Week in AI
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
M
MIT News - Artificial intelligence
V
V2EX
D
Darknet – Hacking Tools, Hacker News & Cyber Security
The Cloudflare Blog
The GitHub Blog
The GitHub Blog
博客园 - 聂微东
F
Full Disclosure
C
CERT Recently Published Vulnerability Notes

Hacker News - Newest: "LLM"

The LLM Death Spiral | Hacker News Installation The Special Token `<Think>` Problem/Bug of Latest DeepSeek LLM Client Challenge GitHub - baidu-baige/LoongForge: A modular, scalable, high-performance training framework for LLMs, VLMs, diffusion, and embodied models. LLM System Design Benchmark 3.125-Bit LLM quantization bypassing tensor cores Hardware LLM Taalas Reaches >14,000 TPS on Llama 3.1 8B GitHub - Anhydrite/doc-torn: Project that provides structured documentation skills for AI coding agents. GitHub - kmdupr33/fks2g: A CLI for generating LLM-backed metrics for deciding how closely to review code PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-⁠Play If an LLM is too expensive it won't be next year "This paper is LLM reviewed" > "this paper is peer-reviewed" StepStone: LLM-Based GPU Kernel Driver Fuzzing via User-Space Libraries [pdf] GitHub - AssimilatedHuman/LLM-Inquisitor: Evaluating AI behaviour under real‑world work conditions to surface issues before they become problems. LLM INQUISITOR identifies failures (drift, instability etc) by observing AI during normal tasks — a tool the industry desperately needs to stem the 85% failure rate. Includes Quick Start, Practitioner’s Guide and Methodology. Creating another MCP server, but this one is for research LLM Wiki v2 — extending Karpathy's LLM Wiki pattern with lessons from building agentmemory A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents Sator Arepo - a Hugging Face Space by akolpakov Customizing an LLM for Enterprise Software Engineering Most AI agent papers stack one LLM with a vector store, we flipped it Evaluating job search ranking with LLM judged NDCG GitHub - quadracollision/llmisp: JSON AST > Clojure Parity Contracts for Polyglot LLM Commerce: A Case Study GitHub - ndom91/llama-dash: The operations layer for your local LLM stack Agentically optimizing LLM prompt cache TTLs for fun and profit Ask HN: What's your go-to LLM for coding? How do you reduce LLM spam in PR reviews? Ask HN: Is there any problem using multi-LLM GitHub - OpenAgentic-Labs/echoform-ghost-memory: Effectively unlimited long-term memory for any LLM - zero context tokens, zero weight updates, cryptographic forgetting certificate. PSA — Posture Sequence Analysis Why More Context Can Make an LLM Worse GitHub - robertoranon/tokoro: A toolbox for building event publish & discovery web sites, apps, feeds, and more GitHub - sermakarevich/chunker: Agentic approach to chunking a document A new EDIT tool for LLM agents LLMCap — Hard Dollar Caps on LLM API Calls MLSys @ WukLab - Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips What political censorship looks like inside an LLM's weights — a mechanistic-interpretability study of Qwen 3.5 Managing metadata is essential in LLM world Fixing LLM Writing with Distribution Fine Tuning twitter.com Show HN: An LLM that's better at writing The local shape of LLM stable regions GitHub - msunda17/impactarbiter-cli The Infrastructure Behind Making Local LLM Agents Useful PostgreSQL ext makes LLM available as an index for similarity searches,inference GitHub - Tetrahedroned/Agent-Braille: Deterministic 8-bit machine-to-machine protocol for AI agent state. ~92% fewer state-tracking tokens on real Claude Code sessions, a proven single-bit-error-safe command code, fully reproducible. Tell HN: Writing an LLM critique/takedown? – Do not use an LLM to write it 🌱 an LLM models our worst behavior Prompt eval cues predicted refusal shifts across 32k LLM rollouts Ask HN: Is Java the ideal language for LLM-assisted coding? AI Foundry – Flat-Fee Unlimited LLM Inference on Blackwell GPUs in NZ LLM tracing with MLflow AI Gateway LLM Performance by Programming Language The LLM Looked Smart. The Metrics Disagreed – tiago.rio.br The Four Horsemen of the LLM Apocalypse GitHub - piqoni/piqo-extension: A good interface is invisible Intro to TLA+ for the LLM Era: Prompt Your Way to Victory Give every tool LLM wiki and bypass Claude Code SSH Throttle The Ultimate LLM Fine-Tuning Guide Ask HN: What LLM models are you using and why? Five Agents, One Browser: Werewolf on Quack + DuckDB LLM models are not ready for orchestrating many agents ClickBook — Offline AI eReader - Apps on Google Play DeepSeek-V4-Flash means LLM steering is interesting again Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention Recent Developments in LLM Architectures: KV Sharing, MHC, Compressed Attention We Built SynapseKit: The Truth About Production LLM Frameworks GitHub - albedan/ai-ml-gpu-bench: A suite to benchmark CPU/GPU Python performance in training ML models and running local LLMs GitHub - chopratejas/headroom: Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server. if you are redlining the LLM, you aren't headlining Most Meaningful Dates on the Web and for an LLM I tested 8 LLM models on Linux without using the GPU RelaxAI – UK sovereign LLM inference at 80% cheaper than OpenAI/Claude GitHub - Andyyyy64/whichllm: Find the local LLM that actually runs — and performs best — on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly. GitHub - krellixlabs/llm-reasoning-research: Curated, annotated research on reasoning gaps in large language models — temporal reasoning, causal reasoning, and beyond. Agentic evals or LLM as a judge? considering cost, time and quality Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces Add an LLM policy for `rust-lang/rust` by jyn514 · Pull Request #1040 · rust-lang/rust-forge GitHub - nimeshnayaju/markdown-parser: A streaming-capable markdown parser, written in TypeScript Dragos Documents First LLM-Assisted Strike on Water Infrastructure in Mexico Alchemize: PyMC's model to replace Stan/PyMC, etc. with an LLM BlitzGraph - The AI-native backend. Pokémon SVG Bench LLM Witch Hunts are getting F'in Irritating bliki: Interrogatory LLM Ctx-opt: TypeScript middleware to trim LLM chats to a token budget Show HN: Local-first Kubernetes YAML visualizer (no server, no LLM) Why Ruby Is the Better Language for LLM-Powered Development Paper page - Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training Show HN: Asciidia – LLM-Powered Game State media control shapes LLM behaviour by influencing training data Small Model Forensics How LLM Inference Works Multi-LLM AI trading agent harness GitHub - crawshaw/yeah: yeah: LLM-powered yes/no CLI tool Predicting Rare LLM Failures with 30× Fewer Rollouts — LessWrong Mechanism Design for Quality-Preserving LLM Advertising I tried to put an on-device LLM in an iOS Share Extension. It didn't fit
Throughput vs Goodput: The Performance Metric You Are Probably Ignoring in LLM Testing - QAInsights
qainsights · 2026-05-22 · via Hacker News - Newest: "LLM"

In this blog post, we will see the difference between throughput and goodput, why throughput alone can give you a dangerously false sense of confidence, and how goodput the metric championed by NVIDIA’s AIPerf tool tells you the truth about your LLM deployment.

If you have ever shipped a feature that looked perfectly healthy in your monitoring dashboard but fell apart under real user load, this post is for you.

What is Throughput?

Throughput is one of the oldest and most familiar metrics in performance testing. Simply put, it answers the question: how much work can the system do in a given time window?

Depending on the context, throughput is expressed as:

  • Requests per second (req/s) most common in API and web performance testing
  • Transactions per second (TPS) common in database and payment system testing
  • Megabytes per second (MB/s) common in file transfer and network testing
  • Tokens per second specific to LLM inference workloads

In a JMeter test report, the throughput number is front and center. In a k6 summary, it shows up as http_reqs. In a Grafana dashboard, it is usually one of the first panels you look at.

Throughput tells you volume. It does not tell you quality.


The Problem with Throughput Alone

Here is a scenario that should feel familiar.

You run a load test. Throughput looks great 100 req/s. No errors. You ship. Real users start complaining that the app feels sluggish or unresponsive. You go back to your dashboard. Throughput is still 100 req/s. Green across the board.

What happened?

The system was technically completing requests. But a large portion of those requests were taking 4 to 5 seconds to respond instead of the 500ms your users expect. The requests were counted as successful because they returned HTTP 200. Throughput does not care about latency. It just counts completions.

This is the gap. And in traditional web performance testing, experienced engineers close that gap by adding percentile latency checks (p95, p99) as assertions. But in LLM performance testing, the problem is deeper.


The Dosa Stall Analogy

Imagine a busy dosa stall in Coimbatore during the morning rush.

The stall owner proudly says: “We served 100 dosas this hour.” That is throughput. 100 dosas per hour.

But here is the real picture:

  • 28 dosas were served cold because the tawa was overcrowded
  • 15 dosas arrived 20 minutes after the order because the batter queue was too long
  • 5 dosas were undercooked

Only 52 dosas were served hot, crispy, and within the 5-minute promise. That is goodput. 52 dosas per hour.

The stall is technically operating at 100 dosas/hour. But only 52 of them actually met the quality standard the customer was promised.

Now imagine this stall is your LLM API, and each dosa is an inference request. The “hot and crispy within 5 minutes” rule is your SLO.


What is Goodput?

Goodput is the number of requests per second that completed and met all your defined SLO constraints.

This definition comes directly from NVIDIA’s AIPerf tool (the successor to GenAI-Perf), which is the industry standard for LLM inference benchmarking. In AIPerf, you define goodput constraints when you run a benchmark:

aiperf profile \
  --model "llama-3.1-70b" \
  --url http://inference-server:8000 \
  --goodput-ttft 500 \
  --goodput-itl 100

This tells the tool: only count a request toward goodput if:

  • Time to First Token (TTFT) was under 500ms, AND
  • Inter-Token Latency (ITL) was under 100ms

A request that completes but violates either constraint does not count. It is a failed request from the user’s perspective, even if the HTTP status code was 200.

Learn the difference between throughput vs goodput in LLM performance testing and why goodput is the metric that reveals real user experience under load.


LLM inference has two latency metrics that users feel directly:

Time to First Token (TTFT) is how long the user waits before they see the first word of the response. This is what makes an LLM feel fast or laggy. A high TTFT means users are staring at a blank screen or a loading spinner.

Inter-Token Latency (ITL) is the delay between each token in the streamed response. A high ITL makes the text appear to stutter or pause mid-sentence, which breaks the feeling of a natural conversation.

Both of these metrics degrade under load. As concurrency increases, the inference server queue backs up. TTFT climbs first requests sit waiting to be processed. ITL can follow if GPU compute is saturated.

Throughput stays stable through all of this. The server is still completing requests. It is just that the user experience is becoming progressively worse.

Goodput captures that degradation directly. When TTFT crosses your SLO threshold, those requests stop contributing to goodput. The goodput number drops visibly, even while throughput holds steady.

As I showed in an earlier post, 99% of Requests Failed and My Dashboard Showed Green, you can have a request throughput of 0.91 req/s that looks reasonable, while goodput sits at 0.01 req/s meaning 99% of requests were silently breaching the SLO.


The Formula

Goodput is straightforward once you have your SLO thresholds defined:

Goodput (req/s) = Requests that met ALL SLO constraints / Total measurement time (seconds)

For an LLM workload with TTFT and ITL SLOs:

A request counts toward goodput if:
  TTFT < ttft_slo_ms  AND  ITL < itl_slo_ms

Notice that it uses AND, not OR. Both conditions must be satisfied. A request with excellent ITL but a TTFT of 3 seconds still fails. The user waited 3 seconds before seeing anything that is a broken experience regardless of how smooth the streaming was after that.


Pseudocode: Calculating Goodput

Here is a simplified pseudocode showing how goodput is computed behind the scenes:

// Configuration
TTFT_SLO = 500    // milliseconds
ITL_SLO  = 100    // milliseconds

// Tracking
total_requests      = 0
compliant_requests  = 0
measurement_start   = current_time()

// Run benchmark loop
for each request sent:
    result = send_llm_request(prompt)

    total_requests++

    ttft = result.time_to_first_token_ms
    itl  = result.inter_token_latency_ms

    if ttft <= TTFT_SLO AND itl <= ITL_SLO:
        compliant_requests++

// Calculate metrics
measurement_duration_seconds = current_time() - measurement_start

throughput = total_requests / measurement_duration_seconds
goodput    = compliant_requests / measurement_duration_seconds

print("Request Throughput (req/s): " + throughput)
print("Goodput            (req/s): " + goodput)
print("SLO Compliance Rate (%):    " + (compliant_requests / total_requests * 100))

When your system is healthy and under low load, throughput and goodput will be very close. As concurrency increases and the system starts to struggle, you will see goodput diverge downward from throughput. That divergence is your early warning signal.


Throughput vs Goodput: Side-by-Side

DimensionThroughputGoodput
What it measuresAll completed requests per secondCompleted requests per second that met SLO
SLO-awareNoYes
Fails silently on latency degradationYesNo
Typical unitsreq/s, TPS, MB/s, tokens/sreq/s
Tool exampleJMeter, k6, wrkNVIDIA AIPerf
Use caseCapacity planning, raw volumeUser experience validation, production readiness
Can look good while users sufferYesNo

When Should You Use Each Metric?

Use throughput when:

  • You are doing capacity planning and need to understand raw system limits
  • You are comparing infrastructure configurations (e.g. 2 GPU vs 4 GPU) at the same load level
  • You are generating a baseline before adding SLO constraints

Use goodput when:

  • You are validating production readiness of an LLM endpoint
  • You want to know whether users are actually being served well, not just served
  • You are running a concurrency sweep to find the point where your SLO breaks
  • You are integrating LLM performance checks into your CI/CD pipeline

A healthy practice is to report both numbers together. If goodput and throughput are close, your system is healthy. If they diverge significantly, you have a quality problem that raw throughput is hiding.


Key Takeaway

Throughput answers: can the system handle the volume?

Goodput answers: is the system actually serving users well at that volume?

In traditional performance testing, latency SLOs were enforced through assertions and percentile checks. In LLM performance testing, goodput formalizes this into a single metric that is directly comparable to throughput. NVIDIA’s AIPerf makes this measurable out of the box with the --goodput-ttft and --goodput-itl flags.

Next time you look at a load test result, ask yourself: do I know the goodput number? If the answer is no, you only have half the picture.

Happy Testing!