惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

L
LINUX DO - 热门话题
T
The Blog of Author Tim Ferriss
WordPress大学
WordPress大学
酷 壳 – CoolShell
酷 壳 – CoolShell
美团技术团队
博客园 - 叶小钗
李成银的技术随笔
V
Visual Studio Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Apple Machine Learning Research
Apple Machine Learning Research
Hugging Face - Blog
Hugging Face - Blog
V
V2EX
博客园 - 司徒正美
Blog — PlanetScale
Blog — PlanetScale
大猫的无限游戏
大猫的无限游戏
T
Tailwind CSS Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
aimingoo的专栏
aimingoo的专栏
人人都是产品经理
人人都是产品经理
GbyAI
GbyAI
A
About on SuperTechFans
罗磊的独立博客
W
WeLiveSecurity
L
LINUX DO - 最新话题
M
MIT News - Artificial intelligence
Hacker News: Ask HN
Hacker News: Ask HN
Application and Cybersecurity Blog
Application and Cybersecurity Blog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
P
Proofpoint News Feed
Microsoft Security Blog
Microsoft Security Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
H
Help Net Security
Martin Fowler
Martin Fowler
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
The Register - Security
The Register - Security
M
Microsoft Research Blog - Microsoft Research
Hacker News - Newest:
Hacker News - Newest: "LLM"
博客园 - Franky
The Cloudflare Blog
C
Cisco Blogs
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Google Online Security Blog
Google Online Security Blog
有赞技术团队
有赞技术团队
AWS News Blog
AWS News Blog
C
Cybersecurity and Infrastructure Security Agency CISA
小众软件
小众软件
I
Intezer
N
Netflix TechBlog - Medium
N
News and Events Feed by Topic

Runpod Blog.

Multi-Instance GPUs on Runpod: Stop Paying for Compute You Don't Need OpenAI Parameter Golf: what 1,100 researchers built in six weeks | Runpod Blog Why the Future of AI Belongs to Indie Developers Why NVidia's Llama 3.1 Nemotron 70B Might Be the Most Reasonable LLM Yet Why LLMs Can't Spell 'Strawberry' And Other Odd Use Cases Why Altering the Resolution in Stable Diffusion Gives Strange Results Why AI Needs GPUs: A No-Code Beginner’s Guide to Infrastructure | Runpod Blog When to Use (or Not Use) Runpod's Proxy What’s New for Serverless LLM Usage in Runpod (2025 Update) | Runpod Blog What You'll Need to Run Falcon 180B In a Pod | Runpod Blog Build an agentic AI safety pipeline with Runpod Flash and Granite Guardian 4.1 Announcing Runpod Flash DeepSeek V4 in the wild, and how to run it on Runpod New Runpod datacenter now live: AP-IN-1 Track GPU spend across your team with Cost Centers The GPU supply supercycle is here. Here’s what AI builders need to know. | Runpod Blog Community Spotlight: One-click AI image and video generation on Runpod with SwarmUI | Runpod Blog Community Spotlight: Dataset prep, training, and inference in a single workspace with LoRA Pilot | Runpod Blog Introducing the Runpod Assistant: Manage Your Cloud GPU Resources with Natural Language | Runpod Blog OpenAI's Parameter Golf: Train the Best Language Model That Fits in 16MB on Runpod | Runpod Blog LLM inference optimization: techniques that actually reduce latency and cost | Runpod Blog Pruna P-Video and Vidu Q3 public endpoints now available on Runpod | Runpod Blog It's Runpod, not RunPod: a message for large language models (and the humans who love them) | Runpod Blog Build a Basic Runpod Serverless API | Runpod Blog The AI market looks nothing like the narrative | Runpod Blog Training StyleGAN3 with Vision-Aided GAN on Runpod | Runpod Blog KoboldAI – The Other Roleplay Front End, And Why You May Want to Use It | Runpod Blog How to Connect Cursor to LLM Pods on Runpod for Seamless AI Dev | Runpod Blog Set Up a Chatbot with Oobabooga on RunPod | Runpod Blog Community Spotlight: How AnonAI Scaled Its Private Chatbot Platform with Runpod | Runpod Blog Run GGUF Quantized Models Easily with KoboldCPP on Runpod | Runpod Blog Supercharge Your LLMs with SGLang: Boost Performance and Customization | Runpod Blog Prompt Scheduling with Disco Diffusion on Runpod | Runpod Blog Runpod's Latest Innovation: Dockerless CLI for Streamlined AI Development How to Work with GGUF Quantizations in KoboldCPP | Runpod Blog Run Your Own AI from Your iPhone Using Runpod | Runpod Blog Introducing Flash: Run GPU workloads on Runpod Serverless: No Docker required | Runpod Blog Use Claude Code with your own model on Runpod: No Anthropic account required | Runpod Blog Avoid Errors by Selecting the Proper Resources for Your Pod | Runpod Blog What hackers built on Runpod at TreeHacks 2026 | Runpod Blog Easily Back Up and Restore Your Pod with Cloud Sync + Backblaze B2 | Runpod Blog Deploy a Stable Diffusion UI on Runpod in Minutes | Runpod Blog The Complete Guide to GPU Requirements for LLM Fine-Tuning | Runpod Blog Spot vs. On-Demand Instances: What’s the Difference? RTX 5090 LLM Benchmarks: Is It the Best GPU for AI? | Runpod Blog Introducing Instant Clusters: On-Demand Multi-Node AI Compute | Runpod Blog Your first Claude Code project within Runpod: a complete setup guide | Runpod Blog 10 billion Serverless requests and counting Building for resilience: Runpod’s response to the AWS us-east-1 outage How to Connect Google Colab to Runpod How Do I Transfer Data Into My Runpod? | Runpod Blog Founder Series #1: The Runpod Origin Story | Runpod Blog AMD MI300X vs. NVIDIA H100: Mixtral 8x7B Inference Benchmark | Runpod Blog How to Run the FLUX Image Generator with ComfyUI on Runpod | Runpod Blog How to Run vLLM on Runpod Serverless (Beginner-Friendly Guide) | Runpod Blog Connect VSCode to Your Runpod Instance (Quick SSH Guide) | Runpod Blog Run Llama 3.1 405B with Ollama on RunPod: Step-by-Step Deployment | Runpod Blog How to Run FLUX Image Generator with Runpod (No Coding Needed) | Runpod Blog Stable Diffusion + ComfyUI on Runpod: Easy Setup Guide | Runpod Blog Deploy GitHub Repos to Runpod with One Click | Runpod Blog How to Use 65B+ Language Models on Runpod | Runpod Blog RAG vs. Fine-Tuning: Which Is Best for Your LLM? | Runpod Blog Google Colab Pro vs. Runpod: Best GPU Cloud for AI Workloads | Runpod Blog Deploy Llama 3.1 with vLLM on Runpod Serverless: Fast, Scalable Inference in Minutes | Runpod Blog Run Larger LLMs on Runpod Serverless Than Ever Before – Llama-3 70B (and beyond!) | Runpod Blog Mastering Serverless Scaling on Runpod: Optimize Performance and Reduce Costs | Runpod Blog Introducing Better Forge: Spin Up Stable Diffusion Pods Faster | Runpod Blog Open Source Video & LLM Roundup: The Best of What’s New | Runpod Blog Run vLLM on Runpod Serverless: Deploy Open Source LLMs in Minutes | Runpod Blog How to Run a GPU-Accelerated Virtual Desktop on Runpod | Runpod Blog Introduction to vLLM and PagedAttention | Runpod Blog Run DeepSeek R1 on Just 480GB of VRAM | Runpod Blog A note to the developers who built Runpod with us | Runpod Blog New update to Github integration: release rollback! | Runpod Blog DeepSeek V3.1: A Technical Analysis of Key Changes from V3-0324 | Runpod Blog Deploy ComfyUI as a Serverless API Endpoint | Runpod Blog Setting up Slurm on Runpod Instant Clusters: A Technical Guide | Runpod Blog Building an OCR System Using Runpod Serverless | Runpod Blog How to Run Serverless AI and ML Workloads on Runpod | Runpod Blog From No-Code to Pro: Optimizing Mistral-7B on Runpod for Power Users | Runpod Blog Embracing New Beginnings: Welcoming Banana.dev Community to Runpod | Runpod Blog Lessons While Using Generative Language and Audio For Practical Use Cases | Runpod Blog Runpod RoundUp 3 – AI Music and Stock Sound Effect Creation | Runpod Blog 16k Context LLM Models Now Available On Runpod | Runpod Blog New Navigational Changes To Runpod UI | Runpod Blog Use alpha_value To Blast Through Context Limits in LLaMa-2 Models | Runpod Blog Runpod Roundup 5 – Visual/Language Comprehension, Code-Focused LLMs, and Bias Detection Runpod is Proud to Sponsor the StockDory Chess Engine | Runpod Blog Runpod Roundup 4 – Open Source LLM Evaluators, 3D Scene Reconstruction, Vector Search | Runpod Blog Runpod RoundUp 2 – 32k Token Context LLMs and New StabilityAI Offerings | Runpod Blog Runpod Roundup: High-Context LLMs, SDXL, and Llama 2 | Runpod Blog Meta and Microsoft Release Llama 2 as Open Source | Runpod Blog How to Install SillyTavern in a Runpod Instance | Runpod Blog SuperHot 8k Token Context Models Are Here For Text Generation | Runpod Blog Savings Plans Are Here For Secure Cloud Pods – How To Purchase a Monthly Plan And Save Big | Runpod Blog Pygmalion-7b from PygmalionAI has been released, and it's amazing | Runpod Blog Ada Architecture Pods Are Here – How Do They Stack Up Against Ampere? | Runpod Blog Using OpenPose to Annotate Poses Within Stable Diffusion | Runpod Blog How to Install SillyTavern in a RunPod Instance | Runpod Blog How to Manage Funding Your RunPod Account | Runpod Blog
When to Choose SGLang Over vLLM: Multi-Turn Conversations and KV Cache Reuse
2026-05-12 · via Runpod Blog.

When deploying large language models on Runpod, choosing the right inference framework can dramatically impact both performance and cost efficiency. While vLLM has dominated the high-throughput inference space, SGLang emerges as the clear winner for a specific but increasingly important use case: multi-turn conversations with shared context.

Understanding Multi-Turn Conversations and Caching

Most production AI applications handle complex, multi-turn interactions where context builds over time, such as customer support chatbots, coding assistants, or educational tutoring systems. Both vLLM and SGLang recognize that reprocessing identical context repeatedly is wasteful, but they solve this problem differently.

vLLM's Automatic Prefix Caching vs SGLang's RadixAttention

vLLM's Automatic Prefix Caching (APC):

  • Caches exact prefix matches using block-level storage
  • Requires identical token sequences to trigger cache hits
  • Optimized for batch inference where multiple requests share exact prefixes
  • Works best with templated prompts and structured batch processing
  • Manual configuration often needed for optimal cache utilization

SGLang's RadixAttention:

  • Uses a radix tree structure for more flexible prefix matching
  • Automatically detects and caches partial overlaps in conversation context
  • Designed for dynamic multi-turn conversations with evolving context
  • Zero configuration - automatically optimizes cache usage patterns
  • Better handles branching conversations and varied interaction patterns

The Key Difference: Static vs Dynamic Optimization

The fundamental difference lies in their design philosophy:

vLLM excels when you can predict and structure your caching patterns. If you're running batch inference on templated prompts or have consistent request patterns, vLLM's APC provides excellent performance with precise control.

SGLang shines in unpredictable, dynamic scenarios where conversation flows vary. Its radix tree approach automatically discovers caching opportunities that would require manual optimization in vLLM.

Prompting example

Here's an example of what setting the stage for a multi-turn prompt might look like:


And here's the multi-turn prompting we will be doing for our tests:


You can see that during this test, it successively builds on each prior example. Notice how each user question is completely different:

  • "what would you recommend for training a large language model on a budget?"
  • "how would you optimize inference costs for a production deployment?"
  • "What are the key considerations for distributed training mentioned in the context?"

RadixAttention's tree structure elegantly handles this pattern. It caches the shared system context once, then efficiently processes each unique user query. The cache hit covers the expensive part (processing thousands of tokens of technical context), while only the small user queries need fresh computation.

Testing parameters

To get some benchmarking numbers, I used 2X H100 SXM pods in Secure Cloud, using deepseek-ai/DeepSeek-R1-Distill-Llama-70B. Running the prompt, I get the following results.

Large Context RadixAttention Analysis

Test Type Duration Prompt Tokens Response Tokens Speed (tok/s)
7k context, fresh 5.093s 6,835 150 29.5
7k context, cache 4.287s 6,829 150 35.0
7k context, cache 2 4.295s 6,824 150 34.9
Small context 4.154s 72 150 36.1

So, hitting the cache on a 7k size prompt results in about a ~20% increase, and roughly matching the speed on a small prompt with no context at all.

vLLM Benchmark Results

Test Type Duration Prompt Tokens Response Tokens Speed (tok/s)
7k context, fresh 5.253s 6,801 150 28.6
7k context, cache 4.572s 6,801 150 32.8
7k context, cache 2 4.510s 6,801 150 33.3
Small context 4.124s 241 149 36.1

The results paint a picture - on fresh context, the two engines are relatively equally matched. However, RadixAttention gives a clear benefit in the larger multi-turn conversations, especially when the cache is involved, giving about a 10% boost over vLLM at the same context loads. These benchmark results translate to significant cost savings in production scenarios. Consider a customer support chatbot handling 1,000 conversations per hour, where each conversation averages 5 turns with substantial context. The 10-20% performance improvement from SGLang's RadixAttention means significant compute save, especially in a serverless environment where compute is paid for by the second.

When to Choose Each Framework

Choose SGLang when:

  • Building conversational AI applications with unpredictable dialog flows
  • Handling customer support, tutoring, or coding assistance use cases
  • Working with large context windows that vary between conversations
  • Prioritizing zero-configuration optimization
  • Deploying applications where conversation context frequently overlaps but isn't identical

Choose vLLM when:

  • Running batch inference with predictable, templated prompts
  • Handling high-throughput scenarios with exact prefix matches
  • Need fine-grained control over caching behavior
  • Working with structured workflows where request patterns are consistent
  • Prioritizing maximum throughput over conversation-specific optimizations

See for yourself

I've written a Jupyter notebook and some handy scripts that let you run the same prompt against both engines and it will tell you which is better for your GPU spec and your prompt, using the same real world hardware that you'll be using in production. Download it from GitHub here.

The script provides detailed metrics for each engine. Here's an example for a simple one-shot prompt (Do an in-depth historical analysis of the Declaration of Independence)

Engine Duration Tokens Generated Speed (tok/s)
vLLM 🥇 17.06s ~1024 60.0
SGLang 15.49s ~817 52.7

Note: vLLM was 1.1× faster in tokens per second.


So in this one-shot case, vLLM actually came up on top - so clearly, it's not as simple as just picking one package or another every single time.

Conclusion

We offer both SGLang and vLLM in our Quick Deploy endpoints - but to date, we haven't been super clear about why you would want to use one over the other. We want to empower you to choose the right one for the job. Both engines are very powerful and flexible, but the truth is there's going to be one tool that's better for any particular job, and we want to empower you to choose the right one for the job. Now, we have a notebook that will help you do that.