Why cloud native belongs at the heart of agentic AI: Lessons from building a multi-agent security platform on Kubernetes

Posted on June 17, 2026

In March, I gave a talk at KubeCon + CloudNativeCon Europe 2026 in Amsterdam. After the session, the same questions kept coming up on the CNCF Slack and in person: why build agentic AI on cloud native foundations at all? Which CNCF projects actually do the heavy lifting? Where does the human sit, and how do you organise the teams around it? What follows is the short answer, drawn from a system we are currently developing and rolling out at Orange Innovation.

The context: an internal real-time security-operations platform protecting a regulated production environment, currently in active development with a rollout underway. A2A protocol for inter-agent coordination (open-sourced in 2025, now under the Linux Foundation). MCP for environment integration (hosted under the Agentic AI Foundation, an LF project). Falco with eBPF intercepts every syscall on the workloads we monitor; events flow through Kafka into an Isolation Forest classical anomaly model that pre-filters in front of the LLM-driven agents. The goal is to materially shorten mean time to detect and respond, and to offload rule authorship from human analysts to the agent layer. The lessons below are written for that context, but apply equally to any cloud native estate where a SOC team and a platform team share responsibility for what runs in production.

Figure 1: System overview. A Coordinator Agent (LangGraph + A2A) orchestrates four specialised agents: Detect (Falco + ML), Analyse (Threat Analyst), Remediate, and Notify (Mattermost), plus a Human-in-the-Loop branch and a feedback loop that retrains the anomaly model. Adapted from my talk at KubeCon + CloudNativeCon Europe 2026.

Below are five technical lessons that have held up so far; how we organized the work across teams and the community, and a closing note on why the CNCF and LF stack is the right substrate for this kind of system.

1. Each agent is a Kubernetes workload, not an in-process module

We deploy each agent as its own Deployment, with its own resource limits, identity, and restart policy. Most agents run LangGraph internally for their tool-use and reasoning loop; a few are hand-written without a framework where we need tighter control (see “Why this stack” below). The agent layer behaves like any microservice mesh: canary rollouts, HPA, namespace isolation apply without invention. The opposite pattern (all agents in one process) is faster to demo on a laptop and would be wrong in production. One agent stuck on a model-API timeout drags the rest down.

2. Inter-agent traffic needs mTLS, not a service mesh

A2A messages carry proposed detection rules and response actions; the threat model says they are at least as sensitive as the data plane.

We do not run a service mesh. cert-manager issues per-agent identities; agents perform mTLS directly at their gRPC/HTTP transport, with no sidecar. Cilium provides the network substrate and CiliumNetworkPolicy restricts which agent identities may reach which MCP server. The combination (cert-manager + agent-level mTLS + CiliumNetworkPolicy) is materially simpler than a mesh and gives us what a mesh would have given us.

Of all our choices so far, A2A is the one I would make again without hesitation. Open-sourced in 2025, governed under the Linux Foundation, not bound to a single framework, so operators can plan a 3-to-5-year deployment around it. Pairing A2A (LF) with the CNCF stack (LF) puts the whole substrate under one open governance umbrella, which in regulated industries is a procurement argument as much as a technical one.

3. Agent safety constraints are policy-as-code, not LLM prompt reasoning

In our architecture, a reviewer agent decides whether a proposed action is safe to execute. That could be a detection rule deployment, a containment, a firewall change. The instinct is to put safety constraints in the reviewer’s system prompt. Don’t.

Upstream of the reviewer, a threat-analyst agent classifies each escalation against the MITRE ATT&CK framework, so the reviewer gets a structured input rather than free-form LLM prose. We codified the reviewer’s constraints as OPA policies and Kyverno admission rules. The reviewer calls into OPA via MCP, gets a deterministic verdict, and acts. The reviewer’s prompt itself is short and, frankly, boring. The policy is version-controlled, unit-tested, and code-reviewed like any other artefact. If there is one change to make before anything else, it is this one.

4. Observability rides the A2A trace_id, GitOps owns the configuration

The A2A envelope carries a trace_id with every task; that trace_id is what holds our observability together. Agents emit structured JSON logs with the trace_id, the agent identity, the MCP calls made, and the LLM token usage. Prometheus scrapes per-agent metrics (request rates, MCP-call latency, reviewer auto-execute / auto-reject / escalate ratios). Cilium Hubble gives the flow view when the question is “did the right pod reach the right service”.

The first time we walked an internal stakeholder through a specific automated decision during development, we pulled every entry for that trace_id and lined them up. The whole reasoning chain walked through in about fifteen minutes. Without trace_id propagation through A2A, it would have been a day.

Every agent’s system prompt, tool list, and output schema is a Kubernetes Custom Resource, reconciled by Argo CD from Git. The reviewer’s policy bundle lives in the same repo. Promoting a change is a pull request: code-reviewed, audited, reversible. This is where most early multi-agent deployments fall over: prompts scattered across notebooks and config files until the day an agent surprises someone.

5. Gate the LLM with a classical anomaly model

If every event triggered the full agent fan-out, the LLM tier would dominate the platform’s economics. A scikit-learn Isolation Forest sits in front of the agents, scoring each sample in microseconds on 17 features; only samples above a calibrated threshold reach the agent fan-out. The LLM is invoked on the small fraction that looks genuinely novel: exactly the slice of work a human detection engineer used to do. Per-event latency and token cost both stay bounded, and sizing the LLM tier becomes normal capacity planning.

The Isolation Forest retrains weekly by design, and feature-distribution drift is itself a paged Prometheus metric. The anomaly threshold is not a static constant. It is a policy parameter the reviewer consults at decision time. We can tighten it under load without redeploying agents.

Keep the human in the loop, by protocol, not by culture

Every consequential decision has three terminal states. Auto-execute. Auto-reject. Or escalate to a human SOC analyst on Mattermost, with the full reasoning chain attached and ChatOps commands to approve, dismiss, or investigate inline. The third is not an error path; it is a normal output of the reviewer. It is designed to fire in three cases: reviewer confidence falls below threshold; the asset is on an always-escalate list (control-plane components, identity stores, anything customer-facing or compliance-sensitive); or the proposed action would exceed a configured blast radius.

“Should this case escalate?” is a deterministic policy verdict, version-controlled in Git, with its own SLO and dashboards. It is not a question of which analyst is on shift. If your HITL story is “we’ll add an approval step later”, or “the analyst can always intervene”, you don’t really have one yet.

How development and rollout actually go

As we move from development into rollout, the operational model already looks like any Kubernetes platform we have run before. Alerts are structural: policy bundle failed admission, MCP server p99 latency, anomaly-pre-filter drift, A2A queue depth above watermark. Not “agent X gave a weird answer”. When an agent regresses during iteration, we treat it like any production microservice regression: roll back the Custom Resource via Argo CD, open a ticket, ship a fix through GitOps. No special agent-incident runbook to invent, and that is the point.

What changes for the SOC team is the nature of their work. Rule authorship has been the structural bottleneck for years; offloading it to the agent layer is the explicit goal. Engineers will curate the reviewer’s safety policy and spot-audit deployed rules instead of writing them. The day-to-day artefacts (CRDs, policies, GitOps pull requests) are ones the SOC and platform teams already know how to handle together.

How the work is organised across teams and the community

None of this works without joined-up teams. Three groups touch this system every week: the SOC, who own detection outcomes and the reviewer’s safety policy; the platform team, who own the cluster, GitOps pipeline, and agent runtime; and a small AI engineering group, who own the agent contracts and the anomaly model. We deliberately kept the contracts between them narrow and machine-readable (CRDs, OPA bundles, A2A schemas), so a change in one area never depends on a meeting in another.

The operational gain we are after is not just speed; it is capacity. Scaling detection coverage used to mean hiring more analysts to write more rules. With the agent layer, it means deploying more agent replicas and tightening the reviewer’s policy bundle, a meaningful lever on what was a headcount-bound problem, and time back for analysts on cases that genuinely need a human.

Externally, the system also exists in a community. The CNCF Landscape and the maturity signals attached to it (Sandbox, Incubating, Graduated, plus adoption and governance data) actively shape our technical choices: when we evaluated network policy enforcement, identity issuance, or anomaly tooling, the Landscape gave us a vendor-neutral starting point and the project maturity told us what we could responsibly run in a regulated production environment. The same lens decides where we contribute back. We track upstream issues in the A2A and MCP repositories, file what we hit, and feed lessons back into CNCF working groups. KubeCon talks and CNCF Slack threads are part of the loop, not afterthoughts. Picking cloud native and LF-governed protocols means we are not the only ones improving the substrate.

Why this stack

If any of this looks tractable on paper, it is because the CNCF and broader Linux Foundation projects we built on are simply that good. They let us treat agentic AI as a normal cloud native workload rather than a special case. Kubernetes makes deployment boring in the best way. Falco gives us a syscall-level detection substrate we did not have to write. Cilium and Hubble take identity-aware network policy seriously. cert-manager turns per-agent mTLS into a configuration. OPA and Kyverno make policy-as-code the default. Argo CD makes GitOps for agent CRDs a one-day implementation. Prometheus is the metrics layer the cloud native world runs on.

On the agentic-AI side, AAIF gives MCP a neutral home and A2A is governed under the Linux Foundation. LangGraph is the agent runtime we settled on after trying alternatives, but it is not the only path: frameworks like CrewAI, AutoGen, and LlamaIndex sit in the same space, and for some of our agents we deliberately keep the logic hand-written without any framework at all when we want full control over state machines, retry semantics, and tool-call sequencing. The protocols (A2A, MCP) are what we treat as the durable interfaces; the runtime is a choice we can revisit.

The two questions I keep getting are why cloud native at all for agentic AI, and where the human and the team sit in the loop. Agentic AI inherits all the operational problems cloud native already solved (identity, isolation, policy, observability, GitOps); inventing parallel substrates is wasted motion. And the human path and the team contracts have to be normal outputs of the system, not exceptions bolted on. Find me on the CNCF Slack, or at KubeCon.

About the author

Willem Berroubache is Lead Security Architect at Orange Innovation, where he leads cloud native security architecture for Orange’s 5G core. He is a CNCF Golden Kubestronaut and was selected in 2026 for the Orange Expert Group in Security. He has spoken at KubeCon + CloudNativeCon Europe 2026 in Amsterdam.

推荐订阅源

Cloud Native Computing Foundation