惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
WordPress大学
WordPress大学
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
小众软件
小众软件
美团技术团队
Attack and Defense Labs
Attack and Defense Labs
S
Security Archives - TechRepublic
C
Comments on: Blog
腾讯CDC
V
Visual Studio Blog
Help Net Security
Help Net Security
MyScale Blog
MyScale Blog
S
Secure Thoughts
P
Privacy & Cybersecurity Law Blog
I
Intezer
NISL@THU
NISL@THU
T
Tor Project blog
G
Google Developers Blog
罗磊的独立博客
E
Exploit-DB.com RSS Feed
Hugging Face - Blog
Hugging Face - Blog
The Cloudflare Blog
P
Proofpoint News Feed
C
Cisco Blogs
量子位
A
Arctic Wolf
Scott Helme
Scott Helme
Schneier on Security
Schneier on Security
Blog — PlanetScale
Blog — PlanetScale
I
InfoQ
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Stack Overflow Blog
Stack Overflow Blog
T
Troy Hunt's Blog
H
Heimdal Security Blog
云风的 BLOG
云风的 BLOG
N
News and Events Feed by Topic
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
SecWiki News
SecWiki News
P
Proofpoint News Feed
有赞技术团队
有赞技术团队
B
Blog
C
Check Point Blog
O
OpenAI News
N
News | PayPal Newsroom
www.infosecurity-magazine.com
www.infosecurity-magazine.com
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
L
LINUX DO - 最新话题
L
Lohrmann on Cybersecurity
Hacker News: Ask HN
Hacker News: Ask HN
Security Latest
Security Latest

Runpod Blog.

DeepSeek V4 in the wild, and how to run it on Runpod New Runpod datacenter now live: AP-IN-1 Track GPU spend across your team with Cost Centers The GPU supply supercycle is here. Here’s what AI builders need to know. Community Spotlight: One-click AI image and video generation on Runpod with SwarmUI | Runpod Blog Community Spotlight: LoRA Pilot Data Prep to Inference Introducing the Runpod Assistant: Manage Your Cloud GPU Resources with Natural Language OpenAI's Parameter Golf: Train the Best Language Model That Fits in 16MB on Runpod LLM inference optimization: techniques that actually reduce latency and cost Pruna P-Video and Vidu Q3 public endpoints now available on Runpod Runpod brand spelling guide Quickstart - Runpod Documentation The AI market looks nothing like the narrative Training StyleGAN3 with Vision-Aided GAN on Runpod KoboldAI – The Other Roleplay Front End, And Why You May Want to Use It How to Connect Cursor to LLM Pods on Runpod for Seamless AI Dev Community Spotlight: How AnonAI Scaled Its Private Chatbot Platform with Runpod Prompt Scheduling with Disco Diffusion on Runpod Runpod's Latest Innovation: Dockerless CLI for Streamlined AI Development Run Your Own AI from Your iPhone Using Runpod Introducing Flash: Run GPU workloads on Runpod Serverless: No Docker required Use Claude Code with your own model on Runpod: No Anthropic account required Avoid Errors by Selecting the Proper Resources for Your Pod What hackers built on Runpod at TreeHacks 2026 Easily Back Up and Restore Your Pod with Cloud Sync + Backblaze B2 The Complete Guide to GPU Requirements for LLM Fine-Tuning AI Guides, Tutorials & GPU Infrastructure Insights | Runpod Your first Claude Code project within Runpod: a complete setup guide 10 billion Serverless requests and counting Building for resilience: Runpod’s response to the AWS us-east-1 outage How to Connect Google Colab to Runpod Founder Series #1: The Runpod Origin Story AMD MI300X vs. NVIDIA H100: Mixtral 8x7B Inference Benchmark How to Run the FLUX Image Generator with ComfyUI on Runpod Run Llama 3.1 405B with Ollama on Runpod: Step-by-Step Deployment How to Run FLUX Image Generator with Runpod (No Coding Needed) How to Use 65B+ Language Models on Runpod Deploy Llama 3.1 with vLLM on Runpod Serverless: Fast, Scalable Inference in Minutes Open Source Video & LLM Roundup: The Best of What’s New Run vLLM on Runpod Serverless: Deploy Open Source LLMs in Minutes Introduction to vLLM and PagedAttention New update to Github integration: release rollback! | Runpod Blog A note to the developers who built Runpod with us Deploy ComfyUI as a Serverless API Endpoint Setting up Slurm on Runpod Clusters: A Technical Guide Building an OCR System Using Runpod Serverless From No-Code to Pro: Optimizing Mistral-7B on Runpod for Power Users Lessons While Using Generative Language and Audio For Practical Use Cases Runpod RoundUp 3 – AI Music and Stock Sound Effect Creation New Navigational Changes To Runpod UI Use alpha_value To Blast Through Context Limits in LLaMa-2 Models Runpod Roundup 5 – Visual/Language Comprehension, Code-Focused LLMs, and Bias Detection Runpod is Proud to Sponsor the StockDory Chess Engine Runpod Roundup 4 – Open Source LLM Evaluators, 3D Scene Reconstruction, Vector Search Meta and Microsoft Release Llama 2 as Open Source SuperHot 8k Token Context Models Are Here For Text Generation How to Manage Funding Your Runpod Account Encrypted Volumes on Runpod: Protect Your Data at Rest How to Run a "Hello World" on Runpod Serverless Runpod AI field notes: December 2025 Faster GitHub Builds: Major Performance Improvements to Our Automated Integration Partnering with Defined AI to Bridge the Data Wealth Gap How to Run Serverless AI and ML Workloads on Runpod How to fine-tune a model using Axolotl Transcribe and translate audio files with Faster Whisper Runpod Achieves SOC 2 Type II Certification: Continuing Our Compliance Journey Orchestrating GPU workloads on Runpod with dstack Exploring Runpod Serverless: Create Workers From Templates DeepSeek V3.1: A Technical Analysis of Key Changes from V3-0324 Deep Cogito Releases Suite of LLMs Trained with Iterative Policy Improvement Wan 2.2 Releases With a Plethora Of New Features Iterative Refinement Chains with Small Language Models The New Runpod.io: Clearer, Faster, Built for What’s Next Introducing Clusters: On-Demand Multi-Node AI Compute Run DeepSeek R1 on Just 480GB of VRAM How Do I Transfer Data Into My Runpod? Spot vs. On-Demand Instances: What’s the Difference? Deploy GitHub Repos to Runpod with One Click Run GGUF Quantized Models Easily with KoboldCPP on Runpod How to Work with GGUF Quantizations in KoboldCPP Introducing Better Forge: Spin Up Stable Diffusion Pods Faster Mastering Serverless Scaling on Runpod: Optimize Performance and Reduce Costs RAG vs. Fine-Tuning: Which Is Best for Your LLM? Run Larger LLMs on Runpod Serverless Than Ever Before – Llama-3 70B (and beyond!) How to Run vLLM on Runpod Serverless (Beginner-Friendly Guide) Embracing New Beginnings: Welcoming Banana.dev Community to Runpod Stable Diffusion + ComfyUI on Runpod: Easy Setup Guide Runpod RoundUp 2 – 32k Token Context LLMs and New StabilityAI Offerings Runpod Roundup: High-Context LLMs, SDXL, and Llama 2 16k Context LLM Models Now Available On Runpod Savings Plans Are Here For Secure Cloud Pods – How To Purchase a Monthly Plan And Save Big Pygmalion-7b from PygmalionAI has been released, and it's amazing Ada Architecture Pods Are Here – How Do They Stack Up Against Ampere? Spin up a Text Generation Pod with Vicuna and Experience a GPT-4 Rival Using OpenPose to Annotate Poses Within Stable Diffusion Set Up a Chatbot with Oobabooga on Runpod Connect VSCode to Your Runpod Instance (Quick SSH Guide) Deploy a Stable Diffusion UI on Runpod in Minutes Google Colab Pro vs. Runpod: Best GPU Cloud for AI Workloads How to Run a GPU-Accelerated Virtual Desktop on Runpod
Supercharge Your LLMs with SGLang: Boost Performance and Customization
Brendan McKeag · 2024-08-15 · via Runpod Blog.

Runpod is proud to partner with LMSys once again to put a spotlight on its inference engine SGLang. LMSys has a storied history within the realm of language models with prior contributions such as the Chatbot Arena which compares outputs from competing models, Vicuna, an open source competitor to ChatGPT, and large datasets such as LMSYS-Chat-1M for studying model behavior. SGLang's inference engine is built to be easy to set up and implement, requiring little more than a few lines of code, and is built to scale to workloads of all sizes.

Rethinking Inference Efficiency

Previously with many large language model use cases, speed often took a backseat to inference quality. With the proliferation of LLM-as-a-service use cases, though, it's clear that token throughput is becoming just as important. Users are now interacting with models in a far more granular manner where a slow response time is going to be much more evident and off-putting. Can you imagine how frustrated a user must feel when they're reaching out to a company for technical support, and they're trying to navigate through the chatbot but it's overloaded and unresponsive? It's time to think about not only how much VRAM you're getting for your GPU spend, but also your maximum tokens per second output. Sure, you could simply move to a higher GPU spec, but what if you could just look for a more efficient implementation on the hardware budget you already have?

The SGLang Team

SGLang was originally developed at LMSYS.org, led by Ying Sheng and Lianmin Zheng. They are the co-founders of LMSYS and have collaborated on many other projects there, including FastChat, Chatbot Arena, Vicuna, LMSYS-Chat-1M, and S-LoRA. They have also worked on other projects, such as FlexGen.

Lianmin states, "We've been using Runpod to develop SGLang and run benchmarks. With a variety of affordable GPU models available, it's easy to test our code. Runpod has significantly sped up our development process." Runpod offers dozens of different GPU specs in various generations, allowing developers access to all levels of hardware to ensure compatibility and usability over a wide variety of real-life use cases.

SGLang has been built by individuals with diverse backgrounds from all over the world. The core developers actively working on it include individuals with passion, undergraduates from Shanghai Jiao Tong University, PhD students from Stanford, UC Berkeley, CMU, UCLA, and NTU, as well as engineers from tech companies like Databricks, X.ai, and ByteDance/TikTok.

How SGLang Works

The SGLang team has published a paper on Arxiv explaining their methodology, with the abstract printed below:

Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex language model programs. SGLang consists of a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control. The runtime accelerates execution with novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments show that SGLang achieves up to 6.4x higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat.

A quick review of the literature reveals the following highlights on just how SGLang achieves these benefits:

  1. Many times inference engines "reinvent the wheel" on each generation, even though the vast majority of the provided context window may indeed be the same, only with one additional modification. This is especially true in long conversations. RadixAttention allows reuse of the KV cache, which means it is not merely starting on scratch during each run.
  2. Better implementation of parallelism allows a more complete use of the GPU hardware, leading to fewer wasted cycles.
  3. The compressed finite state machine allows SGLang to take bigger "steps" when guiding the language model's output, especially through parts of the desired format that are fixed or highly predictable.

Essentially, SGLang finds ways to better utilize the hardware given to it to use than the competition does, leading to huge efficiency gains over other engines, all other things being equal.

Benefits of SGLang

These efficiency gains are truly where SGLang shines. Their benchmarks speak for themselves, showing what their engine is capable of with a benchmark of an input dataset of a uniform distribution of 1-256 tokens:

  • 5,000 tokens per second (t/s) serving Llama3-8B bf16 on a single A100
  • Up to 4,000 t/s serving Llama3-70B bf16 on an 8xA100 cluster
  • Up to 10,000 t/s serving Llama3-70B fp8 on an 8xH100 cluster
  • Up to 2,500 t/s serving Llama3-405B fp8 on an 8xH100 cluster

On top of that, SGLang is under the Apache 2.0 license and is completely open source; this speed, flexibility, and licensing makes it suitable for enterprise-level applications serving a wide field of users, especially use cases those requiring a fast response time. SGLang is the fastest among popular open-source inference solutions on many benchmarks and is especially suitable for batch processing and synthetic data generation.

These gains position SGLang as one of the engines of choice where response time is critical. Applications such as virtual assistants, real-time language translation, and even gaming have an increasing need for large language models. The game Infinite Craft is a great example: although it looks like a simple though addictive browser-based game, in truth it is driven entirely by numerous short-form LLM prompts as described by creator Neal Agarwal: "I'm using the latest Llama 2 LLM from Facebook on the backend ... Every time someone tries to craft something novel, I ask Llama 2 with a prompt what the result should be." This is demonstrative of the direction and the token throughput demands that the general public is going to be creating with locally-hosted LLMs, and these are the cases that the SGLang inference engine is going to excel in.

Already, major organizations such as Databricks, Bytedance, UC Berkley, and UCLA are utilizing SGLang for serving LLMs to the public, for researching, or for data pipelining. Some models served at LMSYS Chatbot Arena are also powered by SGLang.

How to Get Started with SGLang on Runpod

You can install and run SGLang in any Pytorch-equipped pod on the platform. Just go to the My Pods page and spin up a pod of any type. You will want to be sure to expose a port for the server to run on, which can be done on the Edit Pod screen (the default that the server runs is 30000.) Installation instructions for the package itself can be found on SGLang's Github, but the easiest way is probably going to be through pip as shown below:

After you install the package, you can start a server with the following command in a terminal:

If you do not have the Meta-Llama-3-8B model already installed, it will helpfully download it for you from Huggingface. You can specify whatever model you would like to use in this command by swapping out its Huggingface path.

You can then send commands through cURL. Localhost will work if you are running on the pod itself. You can also send commands through the proxy if your terminal session is on an endpoint other than the pod (just swap out localhost for the proxy address, e.g. https://8uxszoc3paq2fh-8888.proxy.runpod.net/) The SGLang team has provided documentation for sampling parameters on their Github as well.

You can also connect through Jupyter Notebook once the server is running and send prompts for inference that way:

This and other examples can be found on the SGLang Github.

Get Started with SGLang on Runpod Today

Conclusion

SGLang isn’t just another inference engine; it’s a game-changer for anyone working with large language models. Are you looking to deploy large language models with blazing fast inference speed and optimal efficiency? It's time to try out SGLang on Runpod! Using SGLang over other inference engines can lead to a marked decrease in your serverless billing due to the efficiency gains you'll enjoy. Your requests will finish that much faster and your users happier. We even have an experimental worker for SGLang that you can try out, right now, with more plans to roll it out to the wider platform shortly. How much do you think you might save in the end?

SGLang is designed to be fast, easy to use, and scalable with a focus on token throughput, a design point that is going to become more and more relevant as locally-hosted LLMs find their way into new applications. Optimizations allow for an unprecedented level of efficiency in fielding user requests. Major organizations are already incorporating it into their processes, and it couldn't be easier to get up and running on Runpod and start serving your users whether in a pod or serverless capacity.

If you'd like to discuss SGLang further, please feel free to consider joining the Runpod Discord as well as the LMSys Discord and review LMSys' body of work on their site at lmsys.org.

Author profile: Brendan McKeag