惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

DEV Community

FairLens AI: An Intelligent Dashboard for Automated Bias Auditing I made git merge finish itself — in VS Code, in my terminal, and in CI You just can’t miss this… Redis Essentials: Architecture, Caching, and Setup Docker with AI: A Practical Guide to Running LLMs, Agents and MCP Design to Code #5: Using AI to Build a Design System Analyzing 1,000 Engineering Problems Through GitHub Data Open Graph protocol: canonical reference How a 400-Engineer SaaS Company Cut PR-to-Production from 4.2 Days to 6.4 Hours with Claude Code Multi-Agent DevOps 💬 Embedded AI Chatbots vs Popup Bubbles — Which One Creates Better Engagement? Bajándole todos los minutos posibles al CI del backend con mas de 1000 tests Harness Engineering: Stop Re-Prompting Your Coding Agent Every Session HTML meta referrer: canonical reference AWS MCP Server Just Gave AI Agents Your Cloud Keys — Here's Why That Should Worry You Announcing the Trust Identity Protocol (TIP): HTTPS for the AI Era We built the feature in two days. Making it reliable took two weeks. LuisCore /for-agents.json — agent bootstrap — daily syndication · 2026-05-26 A Curious Journey Into Reverse Engineering an AI-Generated Python .exe Part 2: Enterprise Decision Intelligence Architecture: AI Governance, Threshold Policy Engines, and Operational AI Systems I will continue using Devise with Rails 8! The Developer's Guide to Picking the Right AI Code Model in 2026 (I Spent $500 So You Don’t Have To) 30 Kubernetes Tasks Every CKA Candidate Should Practice Before Exam Day Why Some Websites Feel Instantly Better to Use Advanced React Patterns I Wish I Knew 5 Years Ago ¿Cómo optimizar algoritmos en arreglos y listas con la técnica de dos punteros? I scanned 8 popular open source repos with one command. Here's what I found. mcp-probe v1.6.0: Stricter GitHub Actions checks for MCP CI gates How we connect two strangers' webcams fast (and keep the TURN bill small) LLM Agents Are Now Finding Zero-Days: How AI is Autonomously Rewriting the Rules of Vulnerability Research Minimal Code Doesn’t Mean Stable Code How I manage 40+ skills across Claude Code, Codex, and .agents folders Hardening Stealth Browser Fingerprint Integrity and State Persistence Quick Tip: Benchmarking Multimodal APIs in Under 10 Minutes How I Slashed My AI API Bill by 92% in 2026 — A Cost Optimizer's Speed Benchmark Guide How I Slashed My AI API Bill by 95% — A Practical Guide for 2026 A Go outbox library that runs inside your own DB transaction How I Built a Credit Optimizer That Saves 30-75% on AI Agent Costs (Open Architecture) The Missing POP: How I Ported a Yul Contract to Huff by Reading Every Opcode The Moment the Config Parser Became the Bottleneck Churn Tool Stack by Revenue Stage ($5K to $50K+) What I Learned Exploring AI-Generated 3D: A Hands-On Tour of Meshy, Tripo, and Three.js Day 15 - Software Composition Analysis(SCA) Contributing Upstream Instead of Forking: My grape-swagger-rails Story Behind The Badge: How We Built 2,000 Hackable Badges For Temporal Replay Access Control Doesn't Scale Linearly -- Part 3 33x faster than Rust: Why I stopped waiting for my compiler and built my own. I Built My First Production AWS Project as a Career Changer Why Detecting PII Matters More Than Ever JSON Schema in 10 Minutes — Validation, Types & Real Examples Python Tasks How I Started My Cybersecurity Journey as an SQA Engineer 🔐 Why "fancy fonts" in Discord and Instagram bios turn into boxes ☁️ GKE private cluster setup — common mistakes and how to avoid them I Thought a Username Didn’t Matter… Until I Saw How Much People Care About It Claude for Small Business: 382K Day-One Buyer's Guide I Built a Diagnostic Toolkit for PyTorch Because I Was Tired of Guessing Why Models Fail How I Built an AI-Powered Incident RCA Platform with LangGraph and RAG The Paywall Was a Painted Door Sonnet hallucinated. My agent stored it as fact. How React-Style Time-Slicing Keeps UIs Responsive 这个 Princeton 开源项目让 AI 自己修 Bug,19K Stars 但 90% 的人只用了 1% 功能 🔥 SWE-agent's 5 Hidden Uses Nobody Told You About 🔥 Decompiling Serial Number U-36: Python TERCOM Reconstruction, Cryptographic Logistical Forensics, and Swarm Consensus Fault Tolerance Microservices Patterns You Cannot Outrun a Wave I Fired My Entire Node.js Stack — Rust Rebuilt It in 3 Weeks (The Ugly Truth) BoxAgnts Introduction (2) — AI Agent Toolbox Cursor 3 ships parallel AI agents. Here is the multi-agent workflow that actually works. Prisma-7 A Complete Beginners Guide (With Free Cloud Database!) Akses HDD Rumah dari Laptop Kantor Pakai Tailscale + SMB (Tanpa VPN Ribet) Content Pipeline in MonoGame: Why I Don't Use It Debug Log #1 — The Pipeline That Looked Broken Data Structures in JavaScript: When to Use What (2026) BGP Route Flap Damping: A Solution or a New Problem? First look at AWS DevOps Agent The Next Big “Cult App” Probably Isn’t Another Social Media Platform From Template to Production-Shaped: An AI-Native Dev Flow for Go Side Projects Idempotency Keys: The API Pattern That Saves You From Duplicate Payments and Phantom Records Everyone's Building Jarvis. Nobody's Even Close. The Moment the Jaeger Tracer Exhausted Itself and What We Switched To How to Fix Tool-Use Loops in Autonomous Coding Agents Months of self-testing: Citations shine, other features remain unproven. Claude Code for Canary Deployments: How I Ship to 1% of Users Before Breaking Everything Your recurring scraper is re-downloading data that didn't change. Here's the 15-line fix (conditional GET) 20 Years of GPUs in Numbers: How FLOPS & TDP Grew, and Who Led the NVIDIA vs AMD Race (open dataset, 13.5k GPUs) Espressif Reveals CoreBoard and Korvo Dev Kits for ESP32-S31 Composable Abstraction Layer: o pattern que faltava entre Pinia e seus componentes Vue Your GitHub Actions Logs Are Leaking LLM Keys and Your SIEM Isn't Catching It Solving Complex Logic with Claude and Research Papers Building TheEpicBook: A Deep Dive into a Node.js Monolithic Web Application Haber yazilimi, haber scripti, haber sistemi: ayni urun, uc ayri arama niyeti Predicting Blood Glucose Fluctuations: Building a Transformer-based CGM Forecaster with PyTorch & InfluxDB Pre-task hooks: the one-line wire-up that gives your Hono agent shared memory Concurrent writes to a shared agent memory: what we shipped, what we punted on Building a Production Serverless URL Shortener on AWS — 21 Articles, Every Test Run for Real My CKA Cheat Sheet: Commands, Aliases, and Documentation Tricks I Used During the Exam Frontend Engineering Beyond Pixels: The Architecture of Digital Accessibility VLA or IL? A Controlled Dataset for Testing Whether Finetuning Turns Your VLA into a Fancy Imitation Learner Fabric AI Functions Turn GenAI Into a Data Pipeline Step Proximate vs Ultimate: The Bug Is Never Just the Bug
AI Metrics Decoded: From Parameters to TOPS
Sreeraj Sree · 2026-05-26 · via DEV Community

AI Metrics Decoded: The Numbers That Actually Matter in Production


Why You Need to Know This (Before Your First Production Incident)

Picture this: your team picks a 70B parameter model for a new feature. It runs great on your MacBook. You push to production. The GPU bill arrives. Your manager is not happy.

Or this: your AI API costs explode halfway through the month and nobody knows why.

These are not horror stories. They happen to real engineers — usually the ones who skipped learning the core units of measurement behind AI systems.

As a junior engineer, you're going to face questions like:

  • "Can our GPU handle this model?"
  • "Why is the response so slow?"
  • "How many tokens are we burning per user per day?"
  • "Should we use a 7B or 70B model for this use case?"

Understanding the seven core metrics below gives you the language — and the instincts — to answer confidently.

Let's break them down.


🧠 Category 1: Model Size — Parameters & Tokens

Parameters

What it is: The learned weights inside a neural network. Think of them as the "memory" of the model — numbers that get adjusted during training to capture patterns in data.

The unit: Just a raw count. We usually express it in:

  • M = millions (e.g., BERT = 110M)
  • B = billions (e.g., LLaMA 3 8B, GPT-4 ~1.8T estimated)

Why it matters to you:

Parameter Count Approx. VRAM Needed (fp16) Typical Use Case
1B–3B ~4–6 GB Mobile / edge apps
7B–8B ~16 GB Single consumer GPU
13B–14B ~28 GB Single pro GPU (A100 40GB)
70B ~140 GB Multi-GPU setup
405B+ ~800 GB+ Cluster of H100s

Rule of thumb: 1 billion parameters ≈ 2 GB of VRAM in half-precision (fp16). Double it for full precision (fp32).

More parameters = more capable model and more expensive to run. Always.


Tokens

What it is: The unit of text that a model reads and generates. Not words — fragments.

Quick visual:

Input text:  "Learning AI is fun!"
             ↓ Tokenizer
Tokens:      ["Learn"] ["ing"] [" AI"] [" is"] [" fun"] ["!"]
Token count: 6 tokens

Enter fullscreen mode Exit fullscreen mode

Why it matters to you:

  • API cost is billed per token (input + output separately).
  • Context window is measured in tokens — the model can only "see" so much at once.
  • Speed (TPS, covered below) is measured in tokens per second.
# Quick check: how many tokens is your prompt?
# Using tiktoken (OpenAI's tokenizer, also used by many OSS models)
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
text = "Learning AI is fun!"
tokens = enc.encode(text)

print(f"Token count: {len(tokens)}")   # → 6
print(f"Tokens: {tokens}")             # → [71668, 287, 15592, 374, 2523, 0]

Enter fullscreen mode Exit fullscreen mode

Quick cheat sheet:

  • 1 token ≈ 0.75 English words
  • 1,000 tokens ≈ 750 words ≈ ~1.5 pages
  • Non-English text (Hindi, Mandarin, Arabic) uses 30–70% more tokens for the same content

⚡ Category 2: Hardware Power — FLOPS vs. TOPS

This is where a lot of junior engineers get confused. FLOPS and TOPS sound similar. They are not the same thing.


FLOPS (Floating Point Operations Per Second)

What it is: A measure of raw compute power for floating point arithmetic — the kind of math needed for training and running neural networks.

The scale:

Unit Value Context
GFLOPS 10⁹ FLOPS Your laptop GPU
TFLOPS 10¹² FLOPS Cloud GPUs (A100: ~312 TFLOPS)
PFLOPS 10¹⁵ FLOPS Entire GPU clusters

Used for: Server-scale training and inference. When someone says "the H100 delivers 989 TFLOPS of FP16 performance", this is what they mean.

Common GPUs you'll actually use:

GPU FP16 TFLOPS Best For
RTX 4090 ~165 Local dev / fine-tuning
A100 40GB ~312 Production inference
H100 SXM ~989 Large-scale training

TOPS (Tera Operations Per Second)

What it is: Similar idea, but used for integer or mixed-precision operations on edge hardware and NPUs (Neural Processing Units).

The key difference:

FLOPS  →  Floating point math  →  GPUs / server chips  →  Training & inference at scale
TOPS   →  Integer / INT8 math  →  NPUs / edge chips    →  On-device inference

Enter fullscreen mode Exit fullscreen mode

Real-world examples:

Device TOPS Use Case
Apple M4 Neural Engine ~38 TOPS On-device ML on MacBook
Qualcomm Snapdragon X Elite ~45 TOPS AI PCs / laptops
NVIDIA Jetson Orin ~275 TOPS Edge AI / robotics
Google TPU v5e ~393 TOPS Cloud inference at scale

When do you care about TOPS? When you're deploying a model to a phone, a laptop, or an embedded device — not a data centre. If you're picking a chip for on-device inference, TOPS is your number.


🏋️ Category 3: Training Cost — FLOPs (Cumulative)

Yes, confusingly, FLOPs (with a capital F, no "per second") is a different metric from FLOPS.

What it is: The total number of floating point operations performed during an entire training run. It's a measure of compute budget, not hardware speed.

The unit: Usually expressed as:

  • PetaFLOPs (10¹⁵ operations)
  • Or PetaFLOP/s-days — how many days at a given FLOPS rate the training took

Real-world examples:

Model Estimated Training FLOPs
GPT-3 (175B) ~3.14 × 10²³
LLaMA 2 70B ~2.9 × 10²³
Gemini Ultra ~5 × 10²⁴ (estimated)

Why it matters to you: Directly as a junior engineer, probably not yet. But understanding it helps you reason about:

  • Why training a model from scratch is prohibitively expensive
  • Why fine-tuning (starting from a pre-trained model) is so much cheaper
  • Why companies like Anthropic and OpenAI have massive infrastructure teams

Quick analogy: FLOPS (the hardware rate) is your car's horsepower. FLOPs (training cost) is the total miles driven on a road trip. One is speed, one is distance.


🚀 Category 4: Speed & Latency — TTFT, TPS, TPM

These three are the metrics you'll track the most in production. They live in your dashboards, your SLAs, and your post-mortems.


TTFT — Time To First Token

What it is: How long (in milliseconds) from sending your request to receiving the first token of the response.

Why it matters: This is what determines if your app feels fast. Even if the full response takes 10 seconds, a 200ms TTFT makes the experience feel responsive. It's the AI equivalent of "First Contentful Paint" in web dev.

User sends prompt
        ↓
  [ ... processing ... ]   ← this duration is TTFT
        ↓
First token arrives → streaming begins → user sees output

Enter fullscreen mode Exit fullscreen mode

Good TTFT benchmarks:

Scenario Target TTFT
Real-time chat < 300ms
Interactive coding assistant < 500ms
Background document processing < 2,000ms (acceptable)

TPS — Tokens Per Second

What it is: How many tokens the model generates per second during the response. Also called generation speed or throughput.

Why it matters: TPS determines whether your streaming response feels smooth or painfully slow.

  • A human reads at roughly 3–5 tokens per second comfortably.
  • Models generating at < 10 TPS feel sluggish.
  • Modern API servers target 50–150+ TPS for good UX.

What affects TPS:

  • Model size (bigger = slower per request)
  • Hardware (H100 >> A100 >> consumer GPU)
  • Batch size (serving multiple requests simultaneously reduces per-request TPS)
  • Quantization (INT4/INT8 models run faster, with a small accuracy tradeoff)

TPM — Tokens Per Minute

What it is: Your rate limit from the API provider. The maximum number of tokens your account can process per minute.

Why it matters: Hit your TPM limit and your requests start getting throttled or rejected with 429 Too Many Requests. This is a very common production issue for junior engineers on their first real deployment.

# A common mistake: not accounting for TPM in batch jobs

prompts = load_10000_prompts()   # Each ~500 tokens

for prompt in prompts:
    response = call_llm_api(prompt)   # 🚨 You'll hit TPM limit fast
    process(response)

# Better approach: add rate limiting
import time

TPM_LIMIT = 40000   # tokens per minute (check your plan)
tokens_this_minute = 0
minute_start = time.time()

for prompt in prompts:
    estimated_tokens = len(prompt.split()) * 1.3   # rough estimate

    if tokens_this_minute + estimated_tokens > TPM_LIMIT:
        sleep_time = 60 - (time.time() - minute_start)
        if sleep_time > 0:
            time.sleep(sleep_time)
        tokens_this_minute = 0
        minute_start = time.time()

    response = call_llm_api(prompt)
    tokens_this_minute += estimated_tokens
    process(response)

Enter fullscreen mode Exit fullscreen mode


🔧 Senior Engineer's Note: How It All Connects

Let me show you a real decision you'll face: "Should we use an 8B or 70B model?"

Here's how the metrics interact:

                    8B Model          70B Model
─────────────────────────────────────────────────
Parameters          8 billion         70 billion
VRAM Required       ~16 GB            ~140 GB
GPU Setup           1× A100 40GB      4× A100 40GB
Est. TPS            ~80–120 TPS       ~15–30 TPS
TTFT (A100)         ~150ms            ~400ms
API Cost (est.)     ~$0.15/M tokens   ~$0.90/M tokens
Quality             Good              Excellent
─────────────────────────────────────────────────

Enter fullscreen mode Exit fullscreen mode

The real-world math: Say your app handles 1,000 users/day, each generating ~2,000 tokens per session.

Daily tokens = 1,000 users × 2,000 tokens = 2,000,000 tokens

8B model cost:  2M × $0.00015 = $0.30/day  → $9/month
70B model cost: 2M × $0.00090 = $1.80/day  → $54/month

Enter fullscreen mode Exit fullscreen mode

That's a 6× cost difference. For a startup, that matters.

The senior engineer's question isn't "which model is better?" It's *"which model is good enough for this use case at this scale?"*

Start with the smaller model. Benchmark it against your quality requirements. Scale up only if you have to.


Quick Reference Cheat Sheet

Metric Full Name Measures Typical Unit
Parameters Model size / capacity M, B, T
Tokens Text unit for I/O and cost count
FLOPS Floating Point Ops/sec Hardware speed (server) TFLOPS
TOPS Tera Operations/sec Hardware speed (edge/NPU) TOPS
FLOPs Floating Point Ops (total) Training compute cost PetaFLOPs
TTFT Time To First Token Latency / responsiveness milliseconds
TPS Tokens Per Second Generation speed tokens/sec
TPM Tokens Per Minute API rate limit tokens/min

Where to Go Next

You now have the vocabulary. Here's how to build on it:

  • Experiment with tokenizersplatform.openai.com/tokenizer
  • Benchmark models on your hardware → try llama.cpp or Ollama locally
  • Track TTFT and TPS in your own apps → add timing logs around your API calls from day one
  • Read model cards → every major model release includes parameter count, training FLOPs, and benchmark scores. They're not marketing fluff — they're specs.

The engineers who understand these numbers don't just write code. They make better architectural decisions, avoid expensive surprises, and earn trust faster.

That's the real reason to care.


Got questions? Drop them in the comments.