惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
WordPress大学
WordPress大学
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
小众软件
小众软件
美团技术团队
Attack and Defense Labs
Attack and Defense Labs
S
Security Archives - TechRepublic
C
Comments on: Blog
腾讯CDC
V
Visual Studio Blog
Help Net Security
Help Net Security
MyScale Blog
MyScale Blog
S
Secure Thoughts
P
Privacy & Cybersecurity Law Blog
I
Intezer
NISL@THU
NISL@THU
T
Tor Project blog
G
Google Developers Blog
罗磊的独立博客
E
Exploit-DB.com RSS Feed
Hugging Face - Blog
Hugging Face - Blog
The Cloudflare Blog
P
Proofpoint News Feed
C
Cisco Blogs
量子位
A
Arctic Wolf
Scott Helme
Scott Helme
Schneier on Security
Schneier on Security
Blog — PlanetScale
Blog — PlanetScale
I
InfoQ
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Stack Overflow Blog
Stack Overflow Blog
T
Troy Hunt's Blog
H
Heimdal Security Blog
云风的 BLOG
云风的 BLOG
N
News and Events Feed by Topic
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
SecWiki News
SecWiki News
P
Proofpoint News Feed
有赞技术团队
有赞技术团队
B
Blog
C
Check Point Blog
O
OpenAI News
N
News | PayPal Newsroom
www.infosecurity-magazine.com
www.infosecurity-magazine.com
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
L
LINUX DO - 最新话题
L
Lohrmann on Cybersecurity
Hacker News: Ask HN
Hacker News: Ask HN
Security Latest
Security Latest

Runpod Blog.

DeepSeek V4 in the wild, and how to run it on Runpod New Runpod datacenter now live: AP-IN-1 Track GPU spend across your team with Cost Centers The GPU supply supercycle is here. Here’s what AI builders need to know. Community Spotlight: One-click AI image and video generation on Runpod with SwarmUI | Runpod Blog Community Spotlight: LoRA Pilot Data Prep to Inference Introducing the Runpod Assistant: Manage Your Cloud GPU Resources with Natural Language OpenAI's Parameter Golf: Train the Best Language Model That Fits in 16MB on Runpod Pruna P-Video and Vidu Q3 public endpoints now available on Runpod Runpod brand spelling guide Quickstart - Runpod Documentation The AI market looks nothing like the narrative Training StyleGAN3 with Vision-Aided GAN on Runpod KoboldAI – The Other Roleplay Front End, And Why You May Want to Use It How to Connect Cursor to LLM Pods on Runpod for Seamless AI Dev Community Spotlight: How AnonAI Scaled Its Private Chatbot Platform with Runpod Prompt Scheduling with Disco Diffusion on Runpod Runpod's Latest Innovation: Dockerless CLI for Streamlined AI Development Run Your Own AI from Your iPhone Using Runpod Introducing Flash: Run GPU workloads on Runpod Serverless: No Docker required Use Claude Code with your own model on Runpod: No Anthropic account required Avoid Errors by Selecting the Proper Resources for Your Pod What hackers built on Runpod at TreeHacks 2026 Easily Back Up and Restore Your Pod with Cloud Sync + Backblaze B2 The Complete Guide to GPU Requirements for LLM Fine-Tuning AI Guides, Tutorials & GPU Infrastructure Insights | Runpod Your first Claude Code project within Runpod: a complete setup guide 10 billion Serverless requests and counting Building for resilience: Runpod’s response to the AWS us-east-1 outage How to Connect Google Colab to Runpod Founder Series #1: The Runpod Origin Story AMD MI300X vs. NVIDIA H100: Mixtral 8x7B Inference Benchmark How to Run the FLUX Image Generator with ComfyUI on Runpod Run Llama 3.1 405B with Ollama on Runpod: Step-by-Step Deployment How to Run FLUX Image Generator with Runpod (No Coding Needed) How to Use 65B+ Language Models on Runpod Deploy Llama 3.1 with vLLM on Runpod Serverless: Fast, Scalable Inference in Minutes Open Source Video & LLM Roundup: The Best of What’s New Run vLLM on Runpod Serverless: Deploy Open Source LLMs in Minutes Introduction to vLLM and PagedAttention New update to Github integration: release rollback! | Runpod Blog A note to the developers who built Runpod with us Deploy ComfyUI as a Serverless API Endpoint Setting up Slurm on Runpod Clusters: A Technical Guide Building an OCR System Using Runpod Serverless From No-Code to Pro: Optimizing Mistral-7B on Runpod for Power Users Lessons While Using Generative Language and Audio For Practical Use Cases Runpod RoundUp 3 – AI Music and Stock Sound Effect Creation New Navigational Changes To Runpod UI Use alpha_value To Blast Through Context Limits in LLaMa-2 Models Runpod Roundup 5 – Visual/Language Comprehension, Code-Focused LLMs, and Bias Detection Runpod is Proud to Sponsor the StockDory Chess Engine Runpod Roundup 4 – Open Source LLM Evaluators, 3D Scene Reconstruction, Vector Search Meta and Microsoft Release Llama 2 as Open Source SuperHot 8k Token Context Models Are Here For Text Generation How to Manage Funding Your Runpod Account Encrypted Volumes on Runpod: Protect Your Data at Rest How to Run a "Hello World" on Runpod Serverless Runpod AI field notes: December 2025 Faster GitHub Builds: Major Performance Improvements to Our Automated Integration Partnering with Defined AI to Bridge the Data Wealth Gap How to Run Serverless AI and ML Workloads on Runpod How to fine-tune a model using Axolotl Transcribe and translate audio files with Faster Whisper Runpod Achieves SOC 2 Type II Certification: Continuing Our Compliance Journey Orchestrating GPU workloads on Runpod with dstack Exploring Runpod Serverless: Create Workers From Templates DeepSeek V3.1: A Technical Analysis of Key Changes from V3-0324 Deep Cogito Releases Suite of LLMs Trained with Iterative Policy Improvement Wan 2.2 Releases With a Plethora Of New Features Iterative Refinement Chains with Small Language Models The New Runpod.io: Clearer, Faster, Built for What’s Next Introducing Clusters: On-Demand Multi-Node AI Compute Run DeepSeek R1 on Just 480GB of VRAM How Do I Transfer Data Into My Runpod? Spot vs. On-Demand Instances: What’s the Difference? Deploy GitHub Repos to Runpod with One Click Run GGUF Quantized Models Easily with KoboldCPP on Runpod How to Work with GGUF Quantizations in KoboldCPP Introducing Better Forge: Spin Up Stable Diffusion Pods Faster Supercharge Your LLMs with SGLang: Boost Performance and Customization Mastering Serverless Scaling on Runpod: Optimize Performance and Reduce Costs RAG vs. Fine-Tuning: Which Is Best for Your LLM? Run Larger LLMs on Runpod Serverless Than Ever Before – Llama-3 70B (and beyond!) How to Run vLLM on Runpod Serverless (Beginner-Friendly Guide) Embracing New Beginnings: Welcoming Banana.dev Community to Runpod Stable Diffusion + ComfyUI on Runpod: Easy Setup Guide Runpod RoundUp 2 – 32k Token Context LLMs and New StabilityAI Offerings Runpod Roundup: High-Context LLMs, SDXL, and Llama 2 16k Context LLM Models Now Available On Runpod Savings Plans Are Here For Secure Cloud Pods – How To Purchase a Monthly Plan And Save Big Pygmalion-7b from PygmalionAI has been released, and it's amazing Ada Architecture Pods Are Here – How Do They Stack Up Against Ampere? Spin up a Text Generation Pod with Vicuna and Experience a GPT-4 Rival Using OpenPose to Annotate Poses Within Stable Diffusion Set Up a Chatbot with Oobabooga on Runpod Connect VSCode to Your Runpod Instance (Quick SSH Guide) Deploy a Stable Diffusion UI on Runpod in Minutes Google Colab Pro vs. Runpod: Best GPU Cloud for AI Workloads How to Run a GPU-Accelerated Virtual Desktop on Runpod
LLM inference optimization: techniques that actually reduce latency and cost
Josh Siegel · 2026-03-19 · via Runpod Blog.

Your GPU bill is doubling every quarter, but your throughput metrics haven’t moved. That’s the dirty secret of naive AI serving: raw compute spend doesn’t correlate with actual performance delivered to users. A standard Hugging Face pipeline() call keeps your A100 significantly underutilized under real traffic patterns, because it processes one request sequentially while everything else waits. You’re paying for idle silicon.

The fix isn’t buying bigger GPUs. It’s switching from naive serving to optimized serving, which means deploying the same model differently. High-performance teams running Llama-3-70B in production have converged on a specific stack: vLLM or SGLang as the inference engine, Prometheus for observability, and Runpod as the infrastructure layer that lets them deploy and iterate without managing a Kubernetes cluster. This guide works through the stack in ROI order: quantization (VRAM footprint), serving engine selection (throughput), speculative decoding (latency), and deployment mode (cost-scaling).

The bottlenecks are compute and memory, not just model size

LLM inference has two fundamentally different phases, and they have different performance characteristics.

Prefill is the compute-bound phase. The model processes your entire input prompt in a single forward pass. Prefill determines your Time to First Token (TTFT). On a dense 70B model, a 4,000-token prompt might take 400ms to prefill across a tensor-parallel A100 setup. You can’t parallelize this across requests in the same way, so the only real lever is raw compute.

Decode is the memory-bound phase. The model generates one token at a time, and each step requires loading the entire model’s KV cache from GPU VRAM. VRAM bandwidth almost entirely determines inter-token latency (how fast tokens stream out), not FLOPs. An H100 SXM5 has 3.35 TB/s of memory bandwidth versus an A6000’s 768 GB/s, which explains much of the latency delta between them on long-form generation.

The KV cache is the core pressure point. For every token in a sequence, attention layers store key and value tensors. The memory footprint follows the formula: num_layers × 2 × num_kv_heads × head_dim × seq_len × dtype_bytes. For Llama-3-70B (80 layers, GQA with 8 KV heads, head_dim=128) at BF16 (2 bytes): 80 × 2 × 8 × 128 × 4,096 × 2 ≈ 1.3 GB per request at a 4,096-token context. That number scales linearly with sequence length, which is why long-context workloads saturate VRAM before FLOPs become the bottleneck.

Prometheus is the right tool to see this in real time. The vLLM metrics endpoint exposes vllm:gpu_cache_usage_perc and vllm:num_requests_waiting via a /metrics Prometheus endpoint. Wire these up to Grafana and you’ll immediately see when you’re cache-bound versus compute-bound, which tells you exactly which optimization to reach for.

Flowchart of LLM inference showing prefill and decode phases with KV cache writes and reads

These two metrics tell you which constraint to address first. For most teams serving 70B-class models under concurrent load, VRAM pressure arrives before compute does.

Quantization strategy: fit more model into less VRAM

The single biggest optimization for most teams is quantization, specifically switching from BF16 to a 4-bit format. Here’s why it matters at the unit economics level: a Llama-3-70B model in BF16 occupies ~140GB of VRAM, which requires at minimum two H100 80GB GPUs at roughly $2.69/hr each on Runpod. The same model in 4-bit AWQ fits comfortably on dual RTX A6000s (96GB total), which run at approximately $0.49/hr per GPU on Runpod. That’s over 80% cost reduction with minimal quality loss.

AWQ (Activation-Aware Weight Quantization) is the current standard for Llama-class models. Unlike naive round-to-nearest quantization, AWQ preserves the 1% of weights that have the most impact on activation outputs, which is why the perplexity delta between a well-quantized AWQ model and its BF16 source is often below 0.5 points on standard benchmarks.

You don’t need to quantize the model yourself. The TechxGenus collection on Hugging Face includes production-ready AWQ versions of Llama-3-70B. To deploy it on a Runpod Pod, you pull the vLLM Docker image and set your environment:

H100s support native FP8 tensor cores, so if you have access to them, FP8 quantization is worth evaluating. FP8 inference runs without emulation overhead, vLLM enables it with --quantization fp8, and VRAM usage drops by ~50% versus BF16. The throughput improvement over BF16 is up to 1.6x on generation-heavy workloads, which means you can serve a 70B model on a single H100 SXM with headroom for longer contexts.

To quantize a custom fine-tuned checkpoint, AutoAWQ handles this in Python in under 30 minutes on an A10G:

With your model’s VRAM footprint reduced, the next constraint is how efficiently your serving engine keeps the GPU saturated under real traffic.

Throughput and structured generation with vLLM and SGLang

Continuous Batching, introduced in Orca (2022) and implemented in vLLM, is what makes modern serving engines work. Traditional static batching waits for a full batch of requests to complete before starting new ones. Continuous batching inserts new requests into the decode loop as soon as a slot opens up, keeping GPU utilization well above what you see with sequential processing; real-world figures run 60-85% under steady traffic versus the low utilization of naive serving.

vLLM also implements PagedAttention, which treats VRAM like virtual memory for KV cache, eliminating the need to pre-allocate contiguous blocks. PagedAttention allows more sequences to coexist in memory simultaneously, directly improving throughput on concurrent workloads.

For agentic workflows, multi-step chains, and structured JSON output, SGLang frequently outperforms standard vLLM. The reason is SGLang’s RadixAttention mechanism, which automatically reuses the KV cache for shared prompt prefixes across requests. In an agentic workflow where every request starts with the same system prompt and tool definitions (often 1,000+ tokens), RadixAttention means that prefix is computed once and cached, not recomputed per request. At scale, RadixAttention can deliver significantly lower effective TTFT on agent-style workloads compared to recomputing the prefix on every request.

The LMSYS benchmark data puts this concretely: SGLang consistently delivers higher throughput on structured generation tasks compared to equivalent vLLM configurations, specifically because of this shared prefix optimization.

Decision flowchart matching LLM workload types to inference engines like vLLM and SGLang

Whether you’re using vLLM or SGLang, these flags matter when you deploy via a Runpod Pod or template. For vLLM: --max-num-seqs controls the maximum number of sequences in the batch. The right value depends on your average context length and available VRAM. Set it too high and you’ll OOM; too low and you leave throughput on the table. A starting point for dual A6000s with a quantized 70B is --max-num-seqs 64. Add --disable-log-stats in production to eliminate the logging overhead that adds a few milliseconds per batch on high-QPS endpoints.

For SGLang: --tp 2 sets tensor parallelism across two GPUs. --chunked-prefill-size 512 controls chunked prefill, which prevents long prompts from monopolizing the GPU and improves latency fairness across concurrent requests. Start with 512 for mixed-length workloads; increase to 1024 if your traffic is predominantly short prompts, or drop to 256 if you’re seeing latency spikes from long system prompts under concurrent load.

These settings handle concurrent throughput. For long-form generation, there’s a separate latency technique worth adding.

Speculative decoding: cut latency without changing hardware

If your workload skews toward long-form generation (coding assistants, document summarization, report generation), speculative decoding is one of the biggest latency reductions you can get without changing hardware.

The mechanism: a small “draft” model (typically 1-7B parameters) generates 3-12 candidate tokens per step. The large target model verifies all candidates in a single parallel forward pass. When the draft model guesses correctly (which, with a well-matched draft model on domain-specific tasks, can happen at rates as high as 70-90%), you get multiple tokens for roughly the cost of 1 target model step. Research on speculative decoding shows 2-3x speedups on generation-heavy tasks.

The economic case is direct: if you’re paying $3/hr for your inference endpoint and speculative decoding cuts latency by 2x, you either halve your cost per request at the same throughput, or serve twice the requests at the same cost. Neither requires touching your hardware configuration.

Here’s how to deploy a speculative decoding setup using the Runpod SDK:

The draft model should be from the same model family as your target. Llama-3-8B-Instruct-AWQ as a draft model for Llama-3-70B-Instruct-AWQ is the canonical pairing. Mismatched architectures produce low acceptance rates that eliminate the speedup. You can verify the draft model’s effectiveness via vLLM’s vllm:spec_decode_draft_acceptance_length metric in Prometheus. If the acceptance rate falls below ~0.5 tokens per step, the draft model is poorly matched and speculative decoding is adding overhead rather than reducing it.

Quantization, engine selection, and speculative decoding handle the model side. What remains is deployment: whether your infrastructure costs track with demand or ahead of it.

Serverless vs. pods: architecting for cost

Runpod Serverless scales to zero between requests and spins up workers on demand. Billing is per-second of GPU time, so you pay only while a worker is active; there’s no reserved-capacity cost during idle periods. This is the right choice for spiky, unpredictable traffic, like a chatbot that sees 1,000 concurrent users at 9am and 20 at 3am. The historical objection to serverless LLM hosting was cold start time: loading a large model from a cold state could take a minute or more, making the first request in any cold-start window intolerable. Runpod’s FlashBoot technology significantly reduces this through container-level and image-level optimizations, making cold starts practical for production use.

Runpod Pods are persistent GPU instances billed per-second. Use them when your traffic is sustained, when you’re running fine-tuning jobs with Ray, or when you need consistent latency guarantees for SLA-bound endpoints. A Ray-based distributed fine-tuning job, for example, requires consistent inter-node communication that serverless cold starts would interrupt.

Decision flowchart routing spiky traffic to Runpod Serverless and sustained traffic to Pods

Infrastructure setup time matters too. The gap between Runpod and bare-metal providers like Lambda Labs is large. To reach the equivalent setup on a bare VM, you’d provision the instance, configure the OS and CUDA drivers, install Docker, set up your orchestration layer (Kubernetes or Slurm), deploy your inference container, configure autoscaling rules, and wire up your load balancer. That’s a realistic two-week sprint for an engineer who hasn’t done it before. On Runpod, you select a vLLM template, set your environment variables, and your endpoint is live in minutes. The time you save isn’t just engineering hours: it’s two weeks where you’re shipping product instead of configuring infrastructure.

Lambda Labs has competitive hardware pricing, but the managed serving layer is thin - you still own the orchestration. If your workload needs auto-scaling inference with short-lived, per-request billing, Runpod’s Serverless infrastructure handles that out of the box. CoreWeave targets enterprises with reserved contracts, which is the wrong motion for a seed-stage startup that needs to validate unit economics before committing to reserved capacity.

Platform selection is the last dial, but it’s not a small one: a well-optimized model stack on the wrong infrastructure still produces the wrong billing curve.

Conclusion

The optimization sequence here is ordered by ROI. Start with quantization (AWQ or FP8 depending on your hardware). It’s a one-time change that cuts your VRAM requirements significantly (roughly 75% with 4-bit AWQ, or 50% with FP8) and immediately opens up cheaper GPU classes. Then select the right serving engine: SGLang for agentic and structured-output workloads, vLLM for chat and general inference. Add speculative decoding if long-form generation is in your critical path. Monitor everything with Prometheus so you’re reacting to actual bottlenecks, not assumptions.

Your implementation checklist:

  1. Quantize with AWQ (or FP8 on H100s) using AutoAWQ or a pre-quantized Hugging Face checkpoint
  2. Choose your engine: SGLang for agents and JSON output, vLLM for chat throughput
  3. Enable speculative decoding on generation-heavy endpoints
  4. Wire up Prometheus to vllm:gpu_cache_usage_perc before you go to production
  5. Match your deployment mode to your traffic pattern: Serverless for spiky, Pods for sustained

The difference between a profitable inference endpoint and a money pit is almost never hardware. It’s the software stack running on that hardware, and the time it took to get it into production.

You don’t need to manage a Kubernetes cluster. The Runpod SDK gets your stack from quantized model to live endpoint in minutes.

To dive deeper, check out The LLM inference optimization playbook: architecting for latency, throughput, and cost.

Author profile: Josh Siegel