





















In this blog post, we will see the difference between throughput and goodput, why throughput alone can give you a dangerously false sense of confidence, and how goodput the metric championed by NVIDIA’s AIPerf tool tells you the truth about your LLM deployment.
If you have ever shipped a feature that looked perfectly healthy in your monitoring dashboard but fell apart under real user load, this post is for you.
Table of Contents
Throughput is one of the oldest and most familiar metrics in performance testing. Simply put, it answers the question: how much work can the system do in a given time window?
Depending on the context, throughput is expressed as:
In a JMeter test report, the throughput number is front and center. In a k6 summary, it shows up as http_reqs. In a Grafana dashboard, it is usually one of the first panels you look at.
Throughput tells you volume. It does not tell you quality.
Here is a scenario that should feel familiar.
You run a load test. Throughput looks great 100 req/s. No errors. You ship. Real users start complaining that the app feels sluggish or unresponsive. You go back to your dashboard. Throughput is still 100 req/s. Green across the board.
The system was technically completing requests. But a large portion of those requests were taking 4 to 5 seconds to respond instead of the 500ms your users expect. The requests were counted as successful because they returned HTTP 200. Throughput does not care about latency. It just counts completions.
This is the gap. And in traditional web performance testing, experienced engineers close that gap by adding percentile latency checks (p95, p99) as assertions. But in LLM performance testing, the problem is deeper.
Imagine a busy dosa stall in Coimbatore during the morning rush.
The stall owner proudly says: “We served 100 dosas this hour.” That is throughput. 100 dosas per hour.
But here is the real picture:
Only 52 dosas were served hot, crispy, and within the 5-minute promise. That is goodput. 52 dosas per hour.
The stall is technically operating at 100 dosas/hour. But only 52 of them actually met the quality standard the customer was promised.
Now imagine this stall is your LLM API, and each dosa is an inference request. The “hot and crispy within 5 minutes” rule is your SLO.
Goodput is the number of requests per second that completed and met all your defined SLO constraints.
This definition comes directly from NVIDIA’s AIPerf tool (the successor to GenAI-Perf), which is the industry standard for LLM inference benchmarking. In AIPerf, you define goodput constraints when you run a benchmark:
aiperf profile \
--model "llama-3.1-70b" \
--url http://inference-server:8000 \
--goodput-ttft 500 \
--goodput-itl 100
This tells the tool: only count a request toward goodput if:
A request that completes but violates either constraint does not count. It is a failed request from the user’s perspective, even if the HTTP status code was 200.
LLM inference has two latency metrics that users feel directly:
Time to First Token (TTFT) is how long the user waits before they see the first word of the response. This is what makes an LLM feel fast or laggy. A high TTFT means users are staring at a blank screen or a loading spinner.
Inter-Token Latency (ITL) is the delay between each token in the streamed response. A high ITL makes the text appear to stutter or pause mid-sentence, which breaks the feeling of a natural conversation.
Both of these metrics degrade under load. As concurrency increases, the inference server queue backs up. TTFT climbs first requests sit waiting to be processed. ITL can follow if GPU compute is saturated.
Throughput stays stable through all of this. The server is still completing requests. It is just that the user experience is becoming progressively worse.
Goodput captures that degradation directly. When TTFT crosses your SLO threshold, those requests stop contributing to goodput. The goodput number drops visibly, even while throughput holds steady.
As I showed in an earlier post, 99% of Requests Failed and My Dashboard Showed Green, you can have a request throughput of 0.91 req/s that looks reasonable, while goodput sits at 0.01 req/s meaning 99% of requests were silently breaching the SLO.
Goodput is straightforward once you have your SLO thresholds defined:
Goodput (req/s) = Requests that met ALL SLO constraints / Total measurement time (seconds)
For an LLM workload with TTFT and ITL SLOs:
A request counts toward goodput if:
TTFT < ttft_slo_ms AND ITL < itl_slo_ms
Notice that it uses AND, not OR. Both conditions must be satisfied. A request with excellent ITL but a TTFT of 3 seconds still fails. The user waited 3 seconds before seeing anything that is a broken experience regardless of how smooth the streaming was after that.
Here is a simplified pseudocode showing how goodput is computed behind the scenes:
// Configuration
TTFT_SLO = 500 // milliseconds
ITL_SLO = 100 // milliseconds
// Tracking
total_requests = 0
compliant_requests = 0
measurement_start = current_time()
// Run benchmark loop
for each request sent:
result = send_llm_request(prompt)
total_requests++
ttft = result.time_to_first_token_ms
itl = result.inter_token_latency_ms
if ttft <= TTFT_SLO AND itl <= ITL_SLO:
compliant_requests++
// Calculate metrics
measurement_duration_seconds = current_time() - measurement_start
throughput = total_requests / measurement_duration_seconds
goodput = compliant_requests / measurement_duration_seconds
print("Request Throughput (req/s): " + throughput)
print("Goodput (req/s): " + goodput)
print("SLO Compliance Rate (%): " + (compliant_requests / total_requests * 100))
When your system is healthy and under low load, throughput and goodput will be very close. As concurrency increases and the system starts to struggle, you will see goodput diverge downward from throughput. That divergence is your early warning signal.
| Dimension | Throughput | Goodput |
|---|---|---|
| What it measures | All completed requests per second | Completed requests per second that met SLO |
| SLO-aware | No | Yes |
| Fails silently on latency degradation | Yes | No |
| Typical units | req/s, TPS, MB/s, tokens/s | req/s |
| Tool example | JMeter, k6, wrk | NVIDIA AIPerf |
| Use case | Capacity planning, raw volume | User experience validation, production readiness |
| Can look good while users suffer | Yes | No |
Use throughput when:
Use goodput when:
A healthy practice is to report both numbers together. If goodput and throughput are close, your system is healthy. If they diverge significantly, you have a quality problem that raw throughput is hiding.
Throughput answers: can the system handle the volume?
Goodput answers: is the system actually serving users well at that volume?
In traditional performance testing, latency SLOs were enforced through assertions and percentile checks. In LLM performance testing, goodput formalizes this into a single metric that is directly comparable to throughput. NVIDIA’s AIPerf makes this measurable out of the box with the --goodput-ttft and --goodput-itl flags.
Next time you look at a load test result, ask yourself: do I know the goodput number? If the answer is no, you only have half the picture.
Happy Testing!
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。