惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

IBM Research

How researchers built a record-setting quantum circuit | IBM Quantum Computing Blog IBM charts a new research path with MIT IBM is modeling the universe with quantum computing How to use sample-based quantum diagonalization on IBM hardware Quantum-centric supercomputing simulates 12,635-atom protein | IBM Quantum Computing Blog A decade of quantum on the cloud | IBM Quantum Computing Blog Ponder This Challenge - May 2026 - The Powers of a Binary Matrix Where the frontiers of high-speed racing and computing meet Introducing the IBM Granite 4.1 family of models Building the future of computing, together Next-generation algorithms could move fusion from the lab to the grid Bringing quantum-centric supercomputing to Illinois What's new at IBM Quantum Q1 2026 Release News: Qiskit v2.4 is here! How IBM Quantum is enabling healthcare and biology research | IBM Quantum Computing Blog Mid-training is essential for LLM reasoning, IBM study shows IBM demonstrates extreme scale for content-aware storage with a 100-billion vector database Ponder This Challenge - April 2026 - The Unlabeled Clock IBM Research and ETH Zurich open a new era of innovation IBM’s newest time-series models cover a full range of enterprise prediction tasks Toward a transparent supply chain for AI Quantum computers take a step into real materials science Cleveland Clinic & IBM debut new quantum simulation workflow Turning turbulence into transcripts Like the information in a dream: IBM’s Charles H. Bennett receives ACM Turing award Doubling down on open-access quantum computing Unveiling the first reference architecture for quantum-centric supercomputing Realizing Feynman’s vision for the future of simulation IBM is working today to secure communication from tomorrow’s quantum risks Building PyTorch-native support for the IBM Spyre Accelerator Quantum simulates properties of the first-ever half-Möbius molecule, designed by IBM and researchers A look back at the International Year of Quantum TerraStackAI: Bringing Earth and space AI to Red Hat and the world Ponder This Challenge - March 2026 - Path game on a hole-riddled chessboard IBM demonstrates High NA EUV process capability on track for insertion below 2 nm nodes at SPIE 2026 Quantum Advantage Tracker: the race to advantage
Donating llm-d to the Cloud Native Computing Foundation
2026-03-24 · via IBM Research

Operationalizing AI inference is hard, especially with cutting-edge models and the infrastructure they require. New workloads are variable, and APIs don’t always make it possible to orchestrate inference. The cloud‑native world is racing to keep up with the demands of modern AI, and large language model (LLM) inference is one place where that pressure is felt most intensely.

As organizations push models into production, they’re discovering that serving LLMs at scale presents a new class of distributed systems challenges. That’s exactly the gap llm‑d was created to fill. llm-d addresses the limitations of traditional routing and autoscaling by offering a Kubernetes‑native distributed inference framework.

Today at KubeCon Europe, IBM Research, Red Hat, and Google Cloud announced the contribution of llm-d to the CNCF as a sandbox project. Launched as a collaborative effort with founding contributors NVIDIA and CoreWeave, and joined by industry leaders AMD, Cisco, Hugging Face, Intel, Lambda, and Mistral AI, alongside university supporters at the University of California, Berkeley, and the University of Chicago, the project has rapidly evolved into state-of-the-art AI infrastructure. This move marks a major milestone in IBM’s mission to make high‑performance, vendor‑neutral, Kubernetes‑native LLM inference accessible to everyone. By aligning with the CNCF, we’re doubling down on open governance, community‑driven development, and the belief that scalable generative AI should be a core feature of the cloud‑native ecosystem.

“llm-d bridges the gap between traditional distributed systems and the emerging AI inference stack, making large-scale model serving a first-class, cloud-native workload,” said Carlos Costa, a Distinguished Engineer at IBM Research who specializes in hybrid cloud platform for AI. “This donation helps establish the CNCF as a home for AI inference infrastructure, catalyzing a broader ecosystem of composable systems and projects.”

Any model, any accelerator, any cloud

The mission of llm-d from the outset was to build a vendor-neutral inference serving stack that can be used with any combination of hardware and software. llm‑d, which was launched in 2025, is a Kubernetes‑native, high‑performance distributed inference framework designed to make serving LLMs at scale both predictable and efficient. To support modern generative AI workloads, it provides a modular architecture that turns inference engines like vLLM into production-ready, distributed, cloud-native inference systems capable of sustaining low latency and high throughput under real-world traffic.

“In the most fundamental sense, we’re taking inference from just standing up something and playing with models, to running them in production at scale with multiple users and models,” said Priya Nagpurkar, vice president of AI platform at IBM Research. “You need the scale, distribution, and reliability of what Kubernetes provided for the previous era, while also recognizing that this is a very different workload,” she added.

At its core, llm‑d addresses the most challenging aspects of LLM inference, including KV‑cache locality management, balancing prefill and decode phases, coordinating multi‑node deployments, maintaining low latency, and efficiently utilizing heterogeneous accelerator hardware. “To deliver efficient inference, llm‑d introduces intelligent inference scheduling and prefix‑cache‑aware routing,” said Vita Bortnikov, IBM Fellow specializing in distributed AI inferencing at IBM Research. “This ensures that each request is routed to the most optimal replica based on cache state, traffic patterns, and hardware topology.”

“Another key capability of llm‑d is hierarchical KV‑cache offloading across GPU, CPU, and storage tiers,” she added. “This significantly improves performance, particularly for long‑context workloads and high levels of concurrency.”

Through prefill/decode disaggregation, llm‑d allows these two fundamentally different phases of inference to scale independently, dramatically improving efficiency for variable workloads. Autoscaling is traffic‑ and hardware‑aware, adapting to real‑time workload characteristics rather than relying on generic CPU and GPU metrics. llm‑d integrates deeply with emerging Kubernetes standards, including the Kubernetes Gateway API Inference Extension (GAIE) and LeaderWorkerSet (LWS), making distributed inference a first‑class Kubernetes workload.

A central promise of llm-d is that it will turn AI infrastructure from a black box into a replicable blueprint for manageable, cloud-native microservices. “This is a well-lit path,” Costa said. “We tested this for you. We benchmarked it. We went through the pain, and this is a path that we provide the community a clear path from experimentation to production.” With reproducible benchmarks, validated deployment patterns, and vendor‑neutral design, llm‑d provides a well‑lit path for organizations seeking production‑grade generative AI infrastructure across NVIDIA, AMD, Intel, and Google TPU accelerators.

A well-lit path

The CNCF is a natural venue for approaching this varied landscape. llm‑d was contributed to the CNCF as a sandbox project to accelerate the standardization, openness, and interoperability of distributed LLM inference across the cloud‑native ecosystem. As organizations race to operationalize generative AI, they’re discovering that LLM inference introduces challenges — stateful scheduling, KV cache locality, multi‑phase execution, heterogeneous accelerators — that expose limitations in the original workload model Kubernetes was designed around. These challenges are too fundamental and too shared to be solved inside a single company’s product roadmap. They require a neutral, community‑driven approach.

By contributing llm‑d to the CNCF, the project’s maintainers aim to establish a vendor‑agnostic, Kubernetes‑native blueprint for high‑performance inference that any organization can adopt. CNCF provides the governance model, IP clarity, and community trust needed for llm‑d to evolve from a promising framework into a widely accepted standard. While IBM, Red Hat, and Google are driving core contributions and early adoption, a growing ecosystem of collaborators is actively exploring integrations with the stack. CNCF stewardship ensures that no single vendor controls the project’s direction and that it remains aligned with upstream Kubernetes APIs such as the GAIE and LWS.

Joining the CNCF also strengthens llm‑d’s mission to create well‑lit paths for production‑grade AI infrastructure. The foundation’s ecosystem provides the ideal environment for building interoperable, standards‑driven components. Ultimately, contributing llm‑d to the CNCF is about ensuring that scalable, efficient, and portable LLM inference becomes a core capability of the cloud‑native stack, not a proprietary feature locked behind closed platforms.

What’s next?

Following the announcement of llm-d’s contribution to the CNCF, the project’s next phase will focus on deepening adoption, expanding technical capabilities, and strengthening its position as the neutral, open-governance inference stack for the AI ecosystem. The donation formalizes llm-d as a community project that will grow as more collaborators join, Costa said.

A key next step is collaborating to support next-generation AI architectures. For instance, Mistral AI is currently contributing features to the llm-d ecosystem to help advance open standards around disaggregated serving. "Creating a common foundation stack has already proven its value," said Costa. "It allows the entire ecosystem to focus on pushing the boundaries of the AI platform rather than rebuilding the basic building blocks."

At the same time, IBM Research will continue driving innovation, especially in areas where the industry lacks proven solutions. This includes work at the intersection of inference and training — reinforcement learning — as well as advancing self‑managing, AI‑guided optimization across caching, scaling, and configuration. The border between scale inference and model adaptation jobs is becoming blurrier, and an inference platform needs to be adapted to suit this reality.

As the project matures, the broader community is actively tackling the next generation of AI infrastructure challenges. The technical roadmap introduces standardized support for multi-modal workloads, expands integration to additional inference engines, and optimizes scheduling for multi-LoRA environments alongside advanced multi-tier KV cache offloading, ensuring llm-d meets the ecosystem’s evolving table-stakes expectations while pushing into new frontiers. Together, these steps position llm-d to evolve rapidly under CNCF governance and accelerate its role as the operating layer for distributed inference.