惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

U
Unit 42
S
Securelist
小众软件
小众软件
WordPress大学
WordPress大学
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
The GitHub Blog
The GitHub Blog
Apple Machine Learning Research
Apple Machine Learning Research
博客园 - 司徒正美
博客园 - Franky
Hugging Face - Blog
Hugging Face - Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
酷 壳 – CoolShell
酷 壳 – CoolShell
O
OpenAI News
Cloudbric
Cloudbric
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
TaoSecurity Blog
TaoSecurity Blog
MongoDB | Blog
MongoDB | Blog
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
V
V2EX
PCI Perspectives
PCI Perspectives
T
Troy Hunt's Blog
Schneier on Security
Schneier on Security
P
Palo Alto Networks Blog
M
MIT News - Artificial intelligence
V2EX - 技术
V2EX - 技术
阮一峰的网络日志
阮一峰的网络日志
Hacker News - Newest:
Hacker News - Newest: "LLM"
G
Google Developers Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
The Last Watchdog
The Last Watchdog
The Register - Security
The Register - Security
腾讯CDC
N
News and Events Feed by Topic
C
Check Point Blog
爱范儿
爱范儿
T
Tailwind CSS Blog
Webroot Blog
Webroot Blog
P
Proofpoint News Feed
S
Schneier on Security
MyScale Blog
MyScale Blog
N
News | PayPal Newsroom
Recorded Future
Recorded Future
T
Tenable Blog
I
InfoQ
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Microsoft Security Blog
Microsoft Security Blog
Simon Willison's Weblog
Simon Willison's Weblog
Engineering at Meta
Engineering at Meta

Ubuntu blog

Challenges designers face in open source (and how to fix them) | Ubuntu Hunting a 16-year-old SQLite bug with TLA+: is dqlite affected? | Ubuntu Anbox Cloud on C4A metal: Android, at scale, without friction | Ubuntu Canonical announces live kernel patching for Arm64 | Ubuntu How to use RISC-V custom instructions with Ubuntu | Ubuntu Ubuntu Summit 26.04: connected by open source | Ubuntu So you need to add microcontrollers to your fleet: now what? | Ubuntu Validating real-world skills through Canonical Academy | Ubuntu Virtualized Android comes to Anbox Cloud | Ubuntu Template: Streamlining open source design contributions | Ubuntu Beyond Mythos: responding to a new threat landscape | Ubuntu A look into Ubuntu Core 26: Building a local AI inference appliance in a virtual machine | Ubuntu A decade of Ubuntu on IBM Z and IBM LinuxONE | Ubuntu AI at the edge: simplifying infrastructure with Cisco and Canonical | Ubuntu The next era of telco clouds: get open infrastructure choice with Sylva and Canonical Kubernetes | Ubuntu What is RDMA over Converged Ethernet (RoCE)? | Ubuntu Beyond tokens per watt – using Ubuntu 26.04 LTS for AI | Ubuntu A look into Ubuntu Core 26: Deploying AI models on Renesas RZ/V series for production | Ubuntu RISC-V profiles – why is RVA23 significant? | Ubuntu AI with AMD ROCm on Ubuntu: your questions answered | Ubuntu Ubuntu and Ubuntu Pro on Azure Cobalt 200 VMs | Ubuntu What is InfiniBand? | Ubuntu How Canonical Support solves hard Linux performance bugs  – even in 12-year old code | Ubuntu Securing AI agent workflows on Ubuntu with the new NVIDIA OpenShell snap | Ubuntu Canonical announces optimized Ubuntu images for TPU virtual machines by Google Cloud | Ubuntu VMware hypervisor deployment using MAAS | Ubuntu Migrating from Apache Spark 3 to Spark 4 | Ubuntu Introducing Workshop: launch sandboxed development environments on Ubuntu with a single command | Ubuntu Run agentic workloads on Arm and Ubuntu | Ubuntu Decoding design: How design and engineering thrive together in open source | Ubuntu Developing web apps with local LLM inference | Ubuntu PinTheft Linux kernel vulnerability mitigation | Ubuntu Canonical announces fully Managed Kubeflow AI operations platform on the Microsoft Azure Marketplace | Ubuntu A look into Ubuntu Core 26: Cloud-powered edge computing with AWS IoT Greengrass and Azure IoT Edge | Ubuntu CVE-2026-46333 (ssh-keysign-pwn) Linux kernel vulnerability mitigations | Ubuntu Finding the blind spot: How Canonical hunts logic flaws with AI | Ubuntu Fragnesia Linux kernel local privilege escalation vulnerability mitigations | Ubuntu Rethinking BYOD security: protecting data without trusting devices | Ubuntu Dirty Frag Linux kernel local privilege escalation vulnerability mitigations | Ubuntu Three weeks to go: A sneak peek of the Ubuntu Summit 26.04 experience | Ubuntu How to use Ubuntu on Windows | Ubuntu Fixes available for CVE-2026-31431 (Copy Fail) Linux Kernel Local Privilege Escalation Vulnerability | Ubuntu Run NVIDIA Nemotron 3 Nano Omni locally in a single command | Ubuntu Why Web Engineering is great | Ubuntu Ubuntu 16.04 LTS has reached the end of standard Expanded Security Maintenance with Ubuntu Pro. Here are your options. | Ubuntu Understanding disaggregated GenAI model serving with llm-d | Ubuntu From Jammy to Resolute: how Ubuntu’s toolchains have evolved | Ubuntu Hybrid search and reranking: a deeper look at RAG | Ubuntu Canonical expands Ubuntu support to next-generation MediaTek Genio 520 and 720 platforms | Ubuntu Intentional leadership at Canonical | Ubuntu Ubuntu Pro comes to Nutanix bare-metal Kubernetes | Ubuntu RISC-V 101 – what is it and what does it mean for Canonical? | Ubuntu Ubuntu Summit 26.04 is coming: Save the date and share your story! | Ubuntu How to manage Ubuntu fleets using on-premises Active Directory and ADSys | Ubuntu Simplify bare metal operations for sovereign clouds | Ubuntu How to Harden Ubuntu SSH: From static keys to cloud identity | Ubuntu The “scanner report has to be green” trap | Ubuntu Modern Linux identity management: from local auth to the cloud with Ubuntu | Ubuntu Hot code burns: the supply chain case for letting your containers cool before you ship | Ubuntu
Canonical welcomes NVIDIA’s donation of the GPU DRA driver to CNCF | Ubuntu
Abdelrahman Hosny (Abdelrahman Hosny) · 2026-03-24 · via Ubuntu blog

At KubeCon Europe in Amsterdam, NVIDIA announced that it will donate the GPU Dynamic Resource Allocation (DRA) Driver to the Cloud Native Computing Foundation (CNCF). This marks an important milestone for the Kubernetes ecosystem and for the future of AI infrastructure.

For years, GPUs have been central to modern machine learning and high-performance computing workloads, yet integrating them into Kubernetes has required specialized tooling and vendor-specific components. The donation of the DRA driver represents a shift toward deeper standardization of GPU orchestration in cloud-native environments. By bringing this technology into the CNCF ecosystem, NVIDIA is helping ensure that advanced GPU scheduling capabilities evolve in the open, alongside the broader Kubernetes community.

This contribution strengthens Kubernetes as the platform for large-scale AI workloads and provides a foundation for more flexible, programmable GPU resource management. To understand why this matters, it helps to look at the broader NVIDIA GPU ecosystem that powers AI workloads on Kubernetes.

The NVIDIA GPU ecosystem for Kubernetes

As of 2026, the NVIDIA GPU stack in Kubernetes is organized into three major layers: the GPU Operator, the Modern Resource Stack built around DRA, and advanced orchestration capabilities such as the Kubernetes AI (KAI) Scheduler. Together, these components transform GPUs from simple hardware accelerators into fully orchestrated infrastructure resources.

The GPU operator: automating GPU infrastructure

The NVIDIA GPU Operator automates the lifecycle management of the software required for GPUs to function inside a Kubernetes cluster. Instead of requiring administrators to manually configure drivers, runtimes, and monitoring tools, the operator deploys and manages these components automatically. This provides a consistent, production-ready environment for GPU workloads.

Typical components deployed by the operator include:

  • NVIDIA Driver: The kernel modules and userspace libraries required for GPU operation are installed through a containerized driver manager.
  • NVIDIA Container Toolkit: This component integrates GPUs with container runtimes such as containerd or CRI-O, allowing containers to access GPU hardware and CUDA libraries on the node.
  • GPU Access Layer: Clusters traditionally used the NVIDIA device plugin to request GPUs using simple integer values. With the introduction of the DRA driver, clusters can adopt the new Kubernetes-native resource model instead. The GPU driver will install and manage the DRA driver for GPUs in an upcoming release. The use of the device plugin and DRA driver in the same cluster is and will remain mutually exclusive.
  • DCGM Exporter: Exports telemetry such as power usage, temperature, and utilization metrics to Prometheus for monitoring.
  • GPU Feature Discovery (GFD): automatically labels Kubernetes nodes with GPU capabilities, such as memory size or CUDA support.
  • NVIDIA MIG Manager: allows modern GPUs such as NVIDIA H100, NVIDIA H200, and NVIDIA Blackwell to be partitioned into multiple logical GPU instances using Multi-Instance GPU (MIG) technology.

The GPU Operator therefore acts as the operational backbone of GPU infrastructure in Kubernetes clusters.

The DRA driver: a modern resource model for GPUs

The DRA driver represents the next generation of GPU resource management for Kubernetes. Historically, Kubernetes treated GPUs as simple integer resources. A workload would request something like nvidia.com/gpu:1. While effective, this model lacked the expressiveness needed for modern AI workloads.

DRA introduces a richer model based on ResourceClaims, enabling applications to request very specific hardware capabilities rather than just a count of GPUs.  

Examples include:

  • Requesting GPUs connected through NVIDIA NVLink
  • Requesting a specific GPU slice
  • Allocating GPUs across nodes that share memory domains

This level of control becomes essential for modern training workloads, which often rely on tightly coupled GPU communication.

DRA also introduces several important capabilities:

  • ComputeDomains: This abstraction enables multi-node NVIDIA NVLink communication. Systems (such as GB200) can allow workloads across multiple nodes to behave as if they are running on a single massive GPU. 
  • Container Device Interface (CDI): Instead of relying on environment variables such as NVIDIA_VISIBLE_DEVICES, CDI injects devices into containers through a standardized interface, improving reliability and portability. 

With the DRA driver moving to the CNCF, these capabilities become part of a broader open ecosystem for accelerator orchestration.

The KAI scheduler: AI-aware scheduling

Running AI workloads efficiently requires more than just allocating GPUs. It requires scheduling decisions that understand how AI jobs behave. The KAI Scheduler adds a layer of intelligence on top of Kubernetes scheduling. It builds on top of the GPU Operator and the DRA driver to enable more advanced resource coordination.  

Key capabilities include:

  • Fractional GPU allocation: Multiple workloads can share a GPU using memory partitioning or time slicing.
  • Hierarchical queuing: Teams can be assigned GPU quotas, and the scheduler manages fairness and prioritization within those quotas.
  • Gang scheduling for distributed training: Large training jobs often require dozens or hundreds of GPUs simultaneously. KAI ensures these jobs start only when the required resources are available, preventing partially allocated clusters that sit idle.

These capabilities are critical for organizations running large-scale training pipelines or shared AI platforms.

Why the CNCF donation matters

The donation of the DRA driver to the CNCF represents a significant step toward making advanced GPU orchestration a first-class citizen of the Kubernetes ecosystem. It accelerates the adoption of Kubernetes-native resource models for GPUs, encourages community-driven innovation, and strengthens the foundation for large-scale AI workloads. As AI infrastructure becomes increasingly central to modern platforms, open collaboration around core technologies like GPU scheduling and resource allocation will play a key role in shaping the next generation of cloud-native systems.

Canonical Kubernetes: a platform for cloud-native AI infrastructure

Running modern AI workloads requires more than GPUs and schedulers. It requires a Kubernetes platform that is secure, easy to operate, and capable of supporting large-scale, hardware-accelerated workloads.

Canonical provides a Kubernetes distribution designed to deliver exactly that. Canonical Kubernetes is a lightweight, secure, and opinionated Kubernetes distribution that includes all the components required to deploy and operate a production-ready cluster. It bundles the essential services needed for Kubernetes clusters, including the container runtime, networking (CNI), DNS, ingress, and other operational components, so that teams can deploy and manage clusters with minimal operational overhead.  

By building directly on upstream Kubernetes, Canonical Kubernetes maintains compatibility with the broader cloud-native ecosystem while simplifying lifecycle management. Security updates and upstream Kubernetes releases are delivered in a streamlined way, allowing teams to stay current without the operational complexity typically associated with cluster maintenance. Canonical Kubernetes is designed to support deployments across a wide range of environments; from small clusters used for experimentation to large enterprise deployments operating across multiple regions. The platform integrates naturally with Canonical’s broader open infrastructure stack and benefits from the reliability and security of Ubuntu. 

For organizations running AI workloads, this provides a stable foundation on which the NVIDIA GPU ecosystem can operate. Components such as the GPU Operator, the DRA driver, and advanced schedulers can be deployed on top of Canonical Kubernetes to enable GPU-accelerated machine learning pipelines, distributed training clusters, and scalable inference platforms.

Together, Canonical Kubernetes and the evolving NVIDIA AI infrastructure ecosystem provide the building blocks needed to run modern AI infrastructure using open, cloud-native technologies.

Further reading