惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

美团技术团队
D
DataBreaches.Net
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
D
Docker
N
Netflix TechBlog - Medium
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
C
Check Point Blog
腾讯CDC
Stack Overflow Blog
Stack Overflow Blog
V
Visual Studio Blog
IT之家
IT之家
月光博客
月光博客
U
Unit 42
K
Kaspersky official blog
T
Threatpost
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
GbyAI
GbyAI
P
Proofpoint News Feed
Last Week in AI
Last Week in AI
云风的 BLOG
云风的 BLOG
酷 壳 – CoolShell
酷 壳 – CoolShell
I
InfoQ
Engineering at Meta
Engineering at Meta
Recorded Future
Recorded Future
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
S
Security @ Cisco Blogs
MyScale Blog
MyScale Blog
大猫的无限游戏
大猫的无限游戏
Security Archives - TechRepublic
Security Archives - TechRepublic
Webroot Blog
Webroot Blog
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Schneier on Security
S
Secure Thoughts
The Register - Security
The Register - Security
B
Blog RSS Feed
The Last Watchdog
The Last Watchdog
P
Palo Alto Networks Blog
爱范儿
爱范儿
B
Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
N
News and Events Feed by Topic
阮一峰的网络日志
阮一峰的网络日志
L
LINUX DO - 热门话题
C
Cisco Blogs
Spread Privacy
Spread Privacy
F
Full Disclosure
博客园 - 聂微东
T
The Blog of Author Tim Ferriss

Cloud Native Computing Foundation

Kepler, re-architected: Improved power accuracy and a community call to action! Dragonfly v2.5.0 is released OTel and mesh-derived metrics: A 2026 reference etcd-operator joins Cozystack with a new v1alpha2 API Security Profiles Operator v1: Stable APIs, Security Hardened, and Shaping Upstream Kubernetes Securing CI/CD for an open source project, part 3: Credentials, verification, and what’s next Building a Cluster-Aware AI Agent with Kubernetes, Argo CD, and GitOps From Awareness to Engineered Accessibility in Open Source Agent Auth: A lawyer’s day in court Building Jaeger’s ClickHouse backend: 8.6× compression on 10 million spans Telemetry that matters: Designing sustainable, high-impact observability pipelines KubeCon + CloudNativeCon, OpenInfra Summit and PyTorch Conference Unite in China to Scale AI Flipkart Wins CNCF End User Case Study Contest for Kubernetes and Chaos Engineering Scale Expanding CARE: Passing CKS can now extend your CKA certification CNCF and Linux Foundation Education Partner with Udemy to Provide a Unified Cloud Native Training & Certification Opportunity CNCF and SlashData Report Confirms India as One of the Largest Cloud Native Communities with 2.25 Million Developers CNCF Welcomes New Silver Members as Global Demand for Cloud Native Infrastructure Grows Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Building a cloud native internal developer platform with Kubernetes, GitOps, and supply chain security The Kubernetes integration tax: Prometheus, Cilium and production reality GPU autoscaling on Kubernetes with KEDA: Building an external scaler Three TAG leads walk into the TOC How Jaeger is evolving to trace AI agents with OpenTelemetry Why Kubernetes policy enforcement happens too late—and what to do about it Zero-Downtime migration from ingress NGINX to Envoy Gateway Client Challenge Client Challenge Client Challenge
Why cloud native belongs at the heart of agentic AI: Lessons from building a multi-agent security platform on Kubernetes
epower · 2026-06-17 · via Cloud Native Computing Foundation

Posted on June 17, 2026

In March, I gave a talk at KubeCon + CloudNativeCon Europe 2026 in Amsterdam. After the session, the same questions kept coming up on the CNCF Slack and in person: why build agentic AI on cloud native foundations at all? Which CNCF projects actually do the heavy lifting? Where does the human sit, and how do you organise the teams around it? What follows is the short answer, drawn from a system we are currently developing and rolling out at Orange Innovation.

The context: an internal real-time security-operations platform protecting a regulated production environment, currently in active development with a rollout underway. A2A protocol for inter-agent coordination (open-sourced in 2025, now under the Linux Foundation). MCP for environment integration (hosted under the Agentic AI Foundation, an LF project). Falco with eBPF intercepts every syscall on the workloads we monitor; events flow through Kafka into an Isolation Forest classical anomaly model that pre-filters in front of the LLM-driven agents. The goal is to materially shorten mean time to detect and respond, and to offload rule authorship from human analysts to the agent layer. The lessons below are written for that context, but apply equally to any cloud native estate where a SOC team and a platform team share responsibility for what runs in production.

Figure 1: System overview. A Coordinator Agent (LangGraph + A2A) orchestrates four specialised agents: Detect (Falco + ML), Analyse (Threat Analyst), Remediate, and Notify (Mattermost), plus a Human-in-the-Loop branch and a feedback loop that retrains the anomaly model. Adapted from my talk at KubeCon + CloudNativeCon Europe 2026.

Figure 1: System overview. A Coordinator Agent (LangGraph + A2A) orchestrates four specialised agents: Detect (Falco + ML), Analyse (Threat Analyst), Remediate, and Notify (Mattermost), plus a Human-in-the-Loop branch and a feedback loop that retrains the anomaly model. Adapted from my talk at KubeCon + CloudNativeCon Europe 2026.

Below are five technical lessons that have held up so far; how we organized the work across teams and the community, and a closing note on why the CNCF and LF stack is the right substrate for this kind of system.

1. Each agent is a Kubernetes workload, not an in-process module

We deploy each agent as its own Deployment, with its own resource limits, identity, and restart policy. Most agents run LangGraph internally for their tool-use and reasoning loop; a few are hand-written without a framework where we need tighter control (see “Why this stack” below). The agent layer behaves like any microservice mesh: canary rollouts, HPA, namespace isolation apply without invention. The opposite pattern (all agents in one process) is faster to demo on a laptop and would be wrong in production. One agent stuck on a model-API timeout drags the rest down.

2. Inter-agent traffic needs mTLS, not a service mesh

A2A messages carry proposed detection rules and response actions; the threat model says they are at least as sensitive as the data plane.

We do not run a service mesh. cert-manager issues per-agent identities; agents perform mTLS directly at their gRPC/HTTP transport, with no sidecar. Cilium provides the network substrate and CiliumNetworkPolicy restricts which agent identities may reach which MCP server. The combination (cert-manager + agent-level mTLS + CiliumNetworkPolicy) is materially simpler than a mesh and gives us what a mesh would have given us.

Of all our choices so far, A2A is the one I would make again without hesitation. Open-sourced in 2025, governed under the Linux Foundation, not bound to a single framework, so operators can plan a 3-to-5-year deployment around it. Pairing A2A (LF) with the CNCF stack (LF) puts the whole substrate under one open governance umbrella, which in regulated industries is a procurement argument as much as a technical one.

3. Agent safety constraints are policy-as-code, not LLM prompt reasoning

In our architecture, a reviewer agent decides whether a proposed action is safe to execute. That could be a detection rule deployment, a containment, a firewall change. The instinct is to put safety constraints in the reviewer’s system prompt. Don’t.

Upstream of the reviewer, a threat-analyst agent classifies each escalation against the MITRE ATT&CK framework, so the reviewer gets a structured input rather than free-form LLM prose. We codified the reviewer’s constraints as OPA policies and Kyverno admission rules. The reviewer calls into OPA via MCP, gets a deterministic verdict, and acts. The reviewer’s prompt itself is short and, frankly, boring. The policy is version-controlled, unit-tested, and code-reviewed like any other artefact. If there is one change to make before anything else, it is this one.

4. Observability rides the A2A trace_id, GitOps owns the configuration

The A2A envelope carries a trace_id with every task; that trace_id is what holds our observability together. Agents emit structured JSON logs with the trace_id, the agent identity, the MCP calls made, and the LLM token usage. Prometheus scrapes per-agent metrics (request rates, MCP-call latency, reviewer auto-execute / auto-reject / escalate ratios). Cilium Hubble gives the flow view when the question is “did the right pod reach the right service”.

The first time we walked an internal stakeholder through a specific automated decision during development, we pulled every entry for that trace_id and lined them up. The whole reasoning chain walked through in about fifteen minutes. Without trace_id propagation through A2A, it would have been a day.

Every agent’s system prompt, tool list, and output schema is a Kubernetes Custom Resource, reconciled by Argo CD from Git. The reviewer’s policy bundle lives in the same repo. Promoting a change is a pull request: code-reviewed, audited, reversible. This is where most early multi-agent deployments fall over: prompts scattered across notebooks and config files until the day an agent surprises someone.

5. Gate the LLM with a classical anomaly model

If every event triggered the full agent fan-out, the LLM tier would dominate the platform’s economics. A scikit-learn Isolation Forest sits in front of the agents, scoring each sample in microseconds on 17 features; only samples above a calibrated threshold reach the agent fan-out. The LLM is invoked on the small fraction that looks genuinely novel: exactly the slice of work a human detection engineer used to do. Per-event latency and token cost both stay bounded, and sizing the LLM tier becomes normal capacity planning.

The Isolation Forest retrains weekly by design, and feature-distribution drift is itself a paged Prometheus metric. The anomaly threshold is not a static constant. It is a policy parameter the reviewer consults at decision time. We can tighten it under load without redeploying agents.

Keep the human in the loop, by protocol, not by culture

Every consequential decision has three terminal states. Auto-execute. Auto-reject. Or escalate to a human SOC analyst on Mattermost, with the full reasoning chain attached and ChatOps commands to approve, dismiss, or investigate inline. The third is not an error path; it is a normal output of the reviewer. It is designed to fire in three cases: reviewer confidence falls below threshold; the asset is on an always-escalate list (control-plane components, identity stores, anything customer-facing or compliance-sensitive); or the proposed action would exceed a configured blast radius.

“Should this case escalate?” is a deterministic policy verdict, version-controlled in Git, with its own SLO and dashboards. It is not a question of which analyst is on shift. If your HITL story is “we’ll add an approval step later”, or “the analyst can always intervene”, you don’t really have one yet.

How development and rollout actually go

As we move from development into rollout, the operational model already looks like any Kubernetes platform we have run before. Alerts are structural: policy bundle failed admission, MCP server p99 latency, anomaly-pre-filter drift, A2A queue depth above watermark. Not “agent X gave a weird answer”. When an agent regresses during iteration, we treat it like any production microservice regression: roll back the Custom Resource via Argo CD, open a ticket, ship a fix through GitOps. No special agent-incident runbook to invent, and that is the point.

What changes for the SOC team is the nature of their work. Rule authorship has been the structural bottleneck for years; offloading it to the agent layer is the explicit goal. Engineers will curate the reviewer’s safety policy and spot-audit deployed rules instead of writing them. The day-to-day artefacts (CRDs, policies, GitOps pull requests) are ones the SOC and platform teams already know how to handle together.

How the work is organised across teams and the community

None of this works without joined-up teams. Three groups touch this system every week: the SOC, who own detection outcomes and the reviewer’s safety policy; the platform team, who own the cluster, GitOps pipeline, and agent runtime; and a small AI engineering group, who own the agent contracts and the anomaly model. We deliberately kept the contracts between them narrow and machine-readable (CRDs, OPA bundles, A2A schemas), so a change in one area never depends on a meeting in another.

The operational gain we are after is not just speed; it is capacity. Scaling detection coverage used to mean hiring more analysts to write more rules. With the agent layer, it means deploying more agent replicas and tightening the reviewer’s policy bundle, a meaningful lever on what was a headcount-bound problem, and time back for analysts on cases that genuinely need a human.

Externally, the system also exists in a community. The CNCF Landscape and the maturity signals attached to it (Sandbox, Incubating, Graduated, plus adoption and governance data) actively shape our technical choices: when we evaluated network policy enforcement, identity issuance, or anomaly tooling, the Landscape gave us a vendor-neutral starting point and the project maturity told us what we could responsibly run in a regulated production environment. The same lens decides where we contribute back. We track upstream issues in the A2A and MCP repositories, file what we hit, and feed lessons back into CNCF working groups. KubeCon talks and CNCF Slack threads are part of the loop, not afterthoughts. Picking cloud native and LF-governed protocols means we are not the only ones improving the substrate.

Why this stack

If any of this looks tractable on paper, it is because the CNCF and broader Linux Foundation projects we built on are simply that good. They let us treat agentic AI as a normal cloud native workload rather than a special case. Kubernetes makes deployment boring in the best way. Falco gives us a syscall-level detection substrate we did not have to write. Cilium and Hubble take identity-aware network policy seriously. cert-manager turns per-agent mTLS into a configuration. OPA and Kyverno make policy-as-code the default. Argo CD makes GitOps for agent CRDs a one-day implementation. Prometheus is the metrics layer the cloud native world runs on.

On the agentic-AI side, AAIF gives MCP a neutral home and A2A is governed under the Linux Foundation. LangGraph is the agent runtime we settled on after trying alternatives, but it is not the only path: frameworks like CrewAI, AutoGen, and LlamaIndex sit in the same space, and for some of our agents we deliberately keep the logic hand-written without any framework at all when we want full control over state machines, retry semantics, and tool-call sequencing. The protocols (A2A, MCP) are what we treat as the durable interfaces; the runtime is a choice we can revisit.

The two questions I keep getting are why cloud native at all for agentic AI, and where the human and the team sit in the loop. Agentic AI inherits all the operational problems cloud native already solved (identity, isolation, policy, observability, GitOps); inventing parallel substrates is wasted motion. And the human path and the team contracts have to be normal outputs of the system, not exceptions bolted on. Find me on the CNCF Slack, or at KubeCon.

About the author

Willem Berroubache is Lead Security Architect at Orange Innovation, where he leads cloud native security architecture for Orange’s 5G core. He is a CNCF Golden Kubestronaut and was selected in 2026 for the Orange Expert Group in Security.  He has spoken at KubeCon + CloudNativeCon Europe 2026 in Amsterdam.