惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

I
Intezer
Jina AI
Jina AI
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
有赞技术团队
有赞技术团队
J
Java Code Geeks
人人都是产品经理
人人都是产品经理
博客园 - 叶小钗
M
MIT News - Artificial intelligence
月光博客
月光博客
C
Check Point Blog
Y
Y Combinator Blog
S
SegmentFault 最新的问题
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
C
Cybersecurity and Infrastructure Security Agency CISA
A
Arctic Wolf
S
Security Archives - TechRepublic
S
Securelist
美团技术团队
SecWiki News
SecWiki News
H
Help Net Security
V
Vulnerabilities – Threatpost
S
Secure Thoughts
F
Fortinet All Blogs
量子位
aimingoo的专栏
aimingoo的专栏
T
Tor Project blog
大猫的无限游戏
大猫的无限游戏
Scott Helme
Scott Helme
MyScale Blog
MyScale Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
D
Docker
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
L
Lohrmann on Cybersecurity
F
Fox-IT International blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
博客园 - 三生石上(FineUI控件)
Engineering at Meta
Engineering at Meta
Microsoft Security Blog
Microsoft Security Blog
Recorded Future
Recorded Future
V
Visual Studio Blog
WordPress大学
WordPress大学
S
Schneier on Security
Stack Overflow Blog
Stack Overflow Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Apple Machine Learning Research
Apple Machine Learning Research
N
News | PayPal Newsroom
GbyAI
GbyAI
T
Threat Research - Cisco Blogs

Databricks

BI Serving Pointers; Maximizing for Performance and TCO How the lakebase architecture stays resilient to cloud failures Introducing Always-On pricing: automatic savings for Databricks Lakebase Announcing Lakebase Change Data Feed (CDF) Building a FHIR-native health data platform on Databricks Lakebase AI readiness in telecommunications Pharma launch analytics: How to compress the first 90 days and win the three years that follow Scaling for MHHS: 50x cost-efficient margin data engineering at Octopus Energy Accelerating LLM Inference with Prompt Caching for Open‑Source Models on Databricks Observability for any agent, anywhere: Production-ready tracing with OpenTelemetry & Unity Catalog on Databricks How World Bank Group uses Databricks to eradicate poverty through shared knowledge How Databricks Genie democratizes data access in financial services Using observability data to prevent incidents How security teams can report cyber risk to boards Transforming industries with conversational AI: Partner solutions built on Databricks Genie From emissions reporting to decarbonization decisions You’ve built the media products, now make them personalized From "What Happened?" to "What Will Happen?" Unlock seamless and cost-effective marketing campaigns with Lakebase Governing AI agents at scale with Unity Catalog How telecom CFOs can make smarter network capex decisions with AI How Databricks Genie improves retail personalization Databricks for Good and Virtue Foundation: Partnering to Connect Medical Volunteers to Critical Health Services in 72 Countries Automate Data & KPI Monitoring with SQL Alerts How to Build Real-Time Fraud Detection using Spark Real-Time Mode and Lakebase How Databricks Genie improves supply chain visibility with real-time AI analytics A CFO’s guide to managing value-based care financial performance Stop Rogue AI: How Unity Catalog Secures Your Agent Actions Why AI Security Infrastructure is Now a CMO Priority Databricks context engineer associate: the industry’s first certification for reliable AI agent systems Introducing AI Spend Controls with Unity AI Gateway How to safeguard AI workloads with Unity AI Gateway Guardrails What’s new in Unity AI Gateway: service policies, guardrails, observability, and cost controls for AI agents and MCPs MemEx: A Programmable Scratchpad for LLM Agents How Deutsche Börse built a generative AI tool to tackle the large-scale migration of Zeppelin notebooks to Databricks Announcing the Databricks analytics engineer learning pathway The question your commercial data should already be able to answer PipelineIQ: Forward‑Looking Sales Intelligence That Drives Action Backstage with Lakebase, part 2 Expanded interoperability with Unity Catalog Open APIs From manual to autonomous: how AI agents are transforming electric grid operations Data quality is the AI strategy Clinical operations intelligence belongs on the Lakehouse The Rosetta stone of CPS: Claroty’s AI-powered library ABAC row filtering and column masking policies, governed tags, and data classification are now generally available in Unity Catalog The Rise of Sports Intelligence: How the Lakehouse Turns Tracking Data into Competitive Advantage How CFOs in consulting can recover margin with Databricks Announcing Native Lakehouse Sync Announcing Databricks Student Fellows Faster Queries and New Capabilities with the Open-Source Databricks JDBC Driver The Convergence of Open Table Formats and Open Catalogs: Catalog Commits is Generally Available Unlocking the Archives: Turning Unstructured Documents into a Searchable Database for Groundwater Discovery Using MemAlign to Improve Evaluation of Traditional Machine Learning in Genie Code How Superhuman and Databricks built a 200K QPS inference platform together Addressing HR's widening capacity gap with AI MCP Marketplace Brings Real-Time Intelligence to Agentic Applications Pushing the Frontier for Data Agents with Genie Energy trading analytics in a real-time market Operating room utilization is hiding in your scheduling data Predictive Quality Starts Where Defect Detection Stops Retail markdown optimization: from reactive markdowns to proactive Why telecom churn prediction misses the intervention window Growth Analytics Is What Comes After Growth Hacking Real-world evidence for medical affairs: who can actually use it? Wealth advisor productivity starts with the client conversation How lakebase architecture delivers 5x faster Postgres writes Why Talent Transformation Is the Missing Focus of Enterprise AI Public Health Intelligence Shouldn't Require a Data Scientist Mean Time to Detect Is a Data Access Problem First-party audience data is the ad sales relationship now Rethinking Distributed Systems for Serverless Performance and Reliability The AI Scaling Gap Hiding in Digital Native Companies 10 trillion samples a day: Scaling beyond traditional monitoring infra at Databricks AI success starts with clean data, not just better models How nOps Rebuilt Their Cloud Optimization Platform on Databricks Lakebase, and Why Other ISVs Should Too Peril Predicts: Precision Payouts for a Volatile World The foundation of AI scalability: one team, one platform, one operating model The Federal Data Paradox: Rich in Data, Poor in Access Driving Budapest Forward: How BKK Uses Databricks to Transform City Mobility LLM Vs AI: A Practical Guide to Differences, Use Cases, and Tools Model Risk Governance Is Not the Same as Risk Intelligence Generative AI for Business: A Complete Strategy and Implementation Guide Data Science vs Data Engineering: Choosing Analysis or Infrastructure AI Applications: Tools, Use Cases, and Platforms MLOps vs DevOps: A Practical Guide for Data Scientists and IT Teams Top Data Warehouse Tools For Modern Data Analytics Unlocking SAP Business Context in Databricks with Semantic Metadata Delta Sharing The marketing activation gap has a fix: Databricks and Stitch partner to turn data infrastructure into marketing performance Alert Fatigue Is a Business Risk Backstage with Lakebase Shipping Faster isn’t Learning Faster Why Your OEE Dashboard Is Lying to You The Turbine That Tried to Tell You It Was Failing Predicting Readmissions Isn't Enough. Acting in Time Is. Clinical Trials Run Longer Than They Have To. That's a Patient Problem Network Quality Is a Revenue Problem, Not a Technical One Shelf Availability Starts with Better Demand Visibility When Predicting the Next Hit Requires More Than Intuition Approximate Answers, Exact Decisions: New Sketch Functions for Analytics Companies Winning with AI Built the Data Layer First
Reliable LLM Inference at Scale
2026-05-28 · via Databricks

At Databricks, we’ve built a unique inference platform that serves every frontier model, from open source models like Kimi and Qwen to proprietary models like OpenAI, Gemini, and Claude. We power inference for some of the largest agentic applications in the world, including Superhuman, Yipit Data, Fox Sports, and others. Today, we serve more than 120T tokens per month.

What makes LLM serving hard at scale is reliability. With agents becoming the interface to how we work and live, inference demand is growing exponentially. We see extremely spiky demand curves that peak during working hours.

Figure 1: 2 days of traffic for one of our largest customers on LLM Serving. Within hours, we see dramatic spikes of traffic.
Figure 1: 2 days of traffic for one of our largest customers on LLM Serving. Within hours, we see dramatic spikes of traffic.

Challenges of running LLM Inference at scale

What does it mean to be a reliable inference platform? The contract appears simple. Availability is whether the request can be processed. But, in practice, different use cases have significantly different latency requirements, and this factors into availability. The most advanced agents cannot afford for p95 time to first token (TTFT) and output tokens per second (OPTS) to degrade.

In a multi-tenant system for LLM serving, achieving both reliability and latency is challenging.

Reliability

Frontier performance requires the latest GPUs with high bandwidth interconnect for KV cache transfer. These compute setups are fundamentally less reliable than classical CPU systems, and they are expensive. Given that all-to-all communication is required,, a single node’s downtime requires reconfiguration for multiple other nodes in disaggregated prefill/decode setups. The highest bandwidth networking requires single-spine connectivity in a single physical rack (e.g. NVL72 systems). This means failures in specific systems within a single datacenter rack can create a wide-blast-radius outage. Standard tricks in distributed systems like multi-AZ or leveraging backup instance types mean keeping expensive backup GPUs idling, a cost-prohibitive option. Overprovisioning is another classic trick, but given compute supply is so constrained, it’s extremely expensive and impractical. Thus, systems must remain operational under heavy strain.

Shipping velocity also needs to remain high under these constraints - our inference demand has grown multiple orders of magnitude year-over-year, and fueling that growth while shipping innovative features was challenging. Features like images, videos, and safety classification each require different preprocessing systems which all must scale independently.

Finally, achieving best-in-class performance and supporting new model architectures requires optimizations that span the gamut from custom kernels to proprietary inference engines. As architectures subtly change, new low-level software often gets introduced that can fail in opaque ways at scale, surfacing in difficult debugging scenarios ranging from server hangs to GPU crashes. 

Latency

Keeping latency under control with diverse load patterns is challenging. This is because the cost to serve a request is highly variable and hard to estimate a priori. Even healthy servers under heavier load process all requests more slowly, exposing a tradeoff between throughput (and thus cost efficiency) and the fastest latency that products need to handle. This can also manifest as a reliability problem, since servers can unexpectedly enter unhealthy states very quickly based on the mix of requests assigned to them.

Figure 2: Realistic concurrency vs. latency benchmarking based on a large customer’s customer support agent workload.

Additionally, latency is dominated by output token generation, but up-front estimation of cost is hard, since it’s difficult to predict how long the model will talk for. Thus, low latency serving requires complex capacity management, load balancing, and request prioritization systems. 

Overall architecture

Before we dive into the specifics of how to address those problems, let’s walk through a high level overview of our serving infrastructure.

In the data plane,

  • The inference runtime (open source and proprietary in-house engines) is deployed on frontier GPUs
  • To handle traffic across model deployments, the data plane runs a router, which we call Axon, that balances load among replicas of the same model, and an autoscaler that adjusts replica counts.

In the control plane,

  • Requests go through rate limiting before reaching the data plane.
  • Based on request metrics, the capacity management algorithm determines how much GPU capacity each workload gets, which the autoscaler then enforces.

control plane and data plane

Getting a handle on capacity

We need to be able to roughly reason about capacity - how much we have, how much we’ve sold, and how much customers are using. To do this, we introduced an abstraction called "model units." If we project that a replica can process a fixed number of model units per minute (e.g., 100), we can make the following assumptions:

  • Requests with long input or output consume more model units, since fewer can complete in the same time window.
  • Prefill and decode have different throughput characteristics, so requests with long output cost more than those with long input.
Figure 3: Cost of a request varies non-linearly and in complex multidimensional ways, depending on the input and output token distribution. This is in sharp contrast to classical AI systems where latency per request is roughly uniformly distributed.
Figure 3: Cost of a request varies non-linearly and in complex multidimensional ways, depending on the input and output token distribution. This is in sharp contrast to classical AI systems where latency per request is roughly uniformly distributed.

Therefore, we model request cost using a multi-dimensional function such as:

The coefficients α, β, γ are determined by automated benchmarking for each model on each hardware type. Model units can be further adjusted for optimizations like prefix caching, and they must account for features like multi-modality. 

Such estimations are structurally imperfect, but they serve as a way for us to break a multi-tenant system into something more manageable that resembles cloud VMs. VMs have the desirable property of offering predictable performance that can be allocated to specific customers. For production agentic workloads, it’s important to offer guarantees around low latency and capacity, and without such allocation systems, the best we can do is offer “best-effort” capacity that could be clawed back if too many customers use the system.

Cost-based load balancing and autoscaling

Since requests have a highly variable impact on servers, it’s important to make nearly optimal routing decisions. In general, load balancing tends to lean on statistical approaches like P2C (power of two choices), which estimate load based on queue size and leverage sampling to reduce the memory and latency overheads of understanding all the possible targets. However, LLM latencies tend to be high, server counts are lower than scaled out CPU systems, and the cost of misrouting is severe. Therefore, LLM serving necessitates a different approach.

Today, we use Dicer, Databricks' auto-sharder, to dynamically route workloads across servers. Without load-aware routing, long-context requests cause individual servers to become hotspots while others sit underutilized. We integrated model units with Dicer so that routing decisions are based on server load in model units rather than traditional request-based heuristics. Dicer also provides stateful sessions, making request routing sticky. A workload's requests go to only a subset of servers, which improves cache hit rates (crucial for latency-sensitive workloads like coding agents) and limits blast radius.

We can also tune the load metrics and even use more optimal routing systems in the future based on higher fidelity cost metrics, as we learn more.

Figure 4: The router and autoscaler both consume server load, so a small number of expensive long-context requests can trigger different routing and scaling decisions than many cheap short requests.
Figure 4: The router and autoscaler both consume server load, so a small number of expensive long-context requests can trigger different routing and scaling decisions than many cheap short requests.

A similar problem exists in autoscaling. Pending request counts alone don't reflect true load. A spike in long-context requests looks identical to a spike in short ones, and CPU and memory metrics are similarly uncorrelated with actual GPU utilization.

Using model units, our autoscaler can decide whether to scale up or down based on the model unit utilization ratio. When the inference engine is running close to some percent of its maximum model units (determined by hardware type and workload shape), it's approaching peak throughput, which triggers scale-up. The reverse triggers scale-down. Rather than manually adjusting auto-scaling rules for each model, this approach allows for model-agnostic scaling infrastructure.

Building autoscaling on top of LLM inference patterns saved us from always scaling to max replicas. For models with bursty traffic, autoscaling kept replica counts close to actual demand, translating to over 80% GPU savings compared to static provisioning at peak.

Runtime Reliability

Smart routing and scaling provided a great foundation, but they don't prevent failures at the engine level. No matter which inference engine we deploy (our in-house engine or popular open-source options), edge cases and resource contention emerge at production scale. We need mechanisms to detect and recover from failures automatically.

Detecting and recovering from silent failures

One failure mode we encounter is silent hangs. Requests involving edge cases (structured output, multimodal inputs) can trigger unhandled errors in the multi-process architecture of inference engines, causing servers to stop responding without surfacing errors.

We detect this with periodic black-box health checks: minimal end-to-end requests sent when no real requests have completed recently. If a health check fails, the Kubernetes liveness probe restarts the server. This works across all engines regardless of internal implementation.

However, under high load, health checks themselves can time out, causing the liveness probe to kill servers that are actually healthy. This risks cascading failures. To solve this, we assign health check requests the highest scheduling priority, ensuring they complete even under heavy load. With prioritized health checks, the full cycle of detecting a hang, killing the unhealthy server, and recovering takes less than 5 minutes. False liveness probe failures dropped from several per week to zero.

Handling unexpected load from multimodal requests

When large batches of multimodal requests arrived, we saw spikes in error rates and timeouts from a completely different source.

Investigations revealed that requests weren't even reaching the inference engine's core processes. Serving image requests is more resource-expensive than text-only requests, not just from the additional vision encoder running on GPUs, but also from CPU-intensive image processing. For certain models, the image processing was extremely slow, blocking the event loop entirely.

Moving blocking operations into separate threads and processes didn't solve the problem; requests still piled up under high image load. So we profiled the Python processes and made several discoveries:

  • Among all CPU operations for images, image processing (resizing and normalization) is 10x slower than other operations like base64 decoding.
  • Some Hugging Face models default to the PIL-based image processor, while others use the faster Torchvision-based processor.
  • In containerized environments, OMP_NUM_THREADS (which controls the number of OpenMP threads used by Torch for CPU operations) defaults to the number of vCPUs on the host machine. In multitenant setups, this is a poor default: a host might have 192 vCPUs, but a container only has access to 12. The result is far more running threads than available cores. This drives CPU usage past the container's limit and triggers throttling.

By switching to Torchvision-based image processors and properly configuring OMP_NUM_THREADS, we sustained much higher QPS and fully leveraged the GPUs. After the fix shipped, requests completed per second jumped >3x with the same replicas and load. CPU throttling disappeared, and servers ran in a much healthier state.

Figure 5: RPS per server after we optimized the image processing bottlenecks
Figure 5: RPS per server after we optimized the image processing bottlenecks

Conclusion

Serving LLMs reliably at scale requires work across every layer of the inference stack. We've covered autoscaling and load balancing infrastructure designed around LLM workloads, and runtime mechanisms that stay stable regardless of engine or workload. There's a lot more to the story: fast container start, safe rollouts across GPU fleets, GPU capacity management across clouds and regions. If these are the kinds of problems you want to work on, we’re hiring!