惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

T
Tenable Blog
Last Week in AI
Last Week in AI
P
Proofpoint News Feed
Engineering at Meta
Engineering at Meta
H
Help Net Security
F
Fortinet All Blogs
MyScale Blog
MyScale Blog
宝玉的分享
宝玉的分享
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
博客园 - 司徒正美
量子位
N
Netflix TechBlog - Medium
Apple Machine Learning Research
Apple Machine Learning Research
小众软件
小众软件
Recorded Future
Recorded Future
博客园 - 三生石上(FineUI控件)
Vercel News
Vercel News
aimingoo的专栏
aimingoo的专栏
I
InfoQ
Microsoft Security Blog
Microsoft Security Blog
Scott Helme
Scott Helme
The Last Watchdog
The Last Watchdog
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
IT之家
IT之家
AI
AI
WordPress大学
WordPress大学
Security Archives - TechRepublic
Security Archives - TechRepublic
Google Online Security Blog
Google Online Security Blog
U
Unit 42
V2EX - 技术
V2EX - 技术
MongoDB | Blog
MongoDB | Blog
Schneier on Security
Schneier on Security
博客园 - Franky
H
Heimdal Security Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Jina AI
Jina AI
W
WeLiveSecurity
P
Privacy & Cybersecurity Law Blog
Cloudbric
Cloudbric
B
Blog RSS Feed
N
News | PayPal Newsroom
S
Securelist
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
I
Intezer
Hacker News - Newest:
Hacker News - Newest: "LLM"
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
博客园_首页
罗磊的独立博客
H
Hackread – Cybersecurity News, Data Breaches, AI and More
雷峰网
雷峰网

Cloud Native Computing Foundation

Kepler, re-architected: Improved power accuracy and a community call to action! Dragonfly v2.5.0 is released OTel and mesh-derived metrics: A 2026 reference etcd-operator joins Cozystack with a new v1alpha2 API Security Profiles Operator v1: Stable APIs, Security Hardened, and Shaping Upstream Kubernetes Securing CI/CD for an open source project, part 3: Credentials, verification, and what’s next Building a Cluster-Aware AI Agent with Kubernetes, Argo CD, and GitOps From Awareness to Engineered Accessibility in Open Source Agent Auth: A lawyer’s day in court Building Jaeger’s ClickHouse backend: 8.6× compression on 10 million spans KubeCon + CloudNativeCon, OpenInfra Summit and PyTorch Conference Unite in China to Scale AI Flipkart Wins CNCF End User Case Study Contest for Kubernetes and Chaos Engineering Scale Expanding CARE: Passing CKS can now extend your CKA certification CNCF and Linux Foundation Education Partner with Udemy to Provide a Unified Cloud Native Training & Certification Opportunity CNCF and SlashData Report Confirms India as One of the Largest Cloud Native Communities with 2.25 Million Developers CNCF Welcomes New Silver Members as Global Demand for Cloud Native Infrastructure Grows Why cloud native belongs at the heart of agentic AI: Lessons from building a multi-agent security platform on Kubernetes Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Client Challenge Building a cloud native internal developer platform with Kubernetes, GitOps, and supply chain security The Kubernetes integration tax: Prometheus, Cilium and production reality GPU autoscaling on Kubernetes with KEDA: Building an external scaler Three TAG leads walk into the TOC How Jaeger is evolving to trace AI agents with OpenTelemetry Why Kubernetes policy enforcement happens too late—and what to do about it Zero-Downtime migration from ingress NGINX to Envoy Gateway Client Challenge Client Challenge Client Challenge
Telemetry that matters: Designing sustainable, high-impact observability pipelines
epower · 2026-06-22 · via Cloud Native Computing Foundation

Posted on June 22, 2026 by Diana Todea - DevRel Engineer at VictoriaMetrics and Cloud Native Days Romania community organizer, Laura Luttmer - Principal Product Manager at Bindplane (Dynatrace), Antonio Jimenez Martinez - Tech Lead Software Engineer at Cisco ThousandEyes

As system architectures grow increasingly complex, the cloud-native community faces a subtle but pressing challenge: we are drowning in our own telemetry data. It is easier than ever to instrument an application and collect signals, but are we actually gaining real insights, or are we just piling up data?

At the recent Observability Summit North America in Minneapolis, a panel of practitioners gathered to dissect this exact problem. This post summarizes the key strategies, shifts, and takeaways discussed during the panel to help engineering teams focus on the telemetry that truly matters.

A photo image showing the welcome screen at Observability Summit North America for 'Telemetry that Matters'.

The core problem: Over-collection and “green” observability

Historically, the baseline strategy for observability was simple: instrument everything and filter it out later. However, industry experience routinely shows that around 50% of collected metrics are never queried or acted upon. This unchecked data collection does more than just bloat storage bills; it introduces steep engineering overhead, increases alert noise, and heightens cognitive load during active incidents.

A critical but frequently overlooked angle of this issue is green observability. Every metric stored, indexed, and processed consumes real compute resources, disk storage, and energy. Reducing telemetry waste isn’t just an infrastructure cost optimization strategy, it directly minimizes the carbon and environmental footprint of our cloud-native platforms.

To build sustainable and highly reliable infrastructure, observability must be treated as a day-zero system design requirement. Teams need to intentionally define what a healthy system looks like and map out exactly which signals are needed to detect structural drift before pushing code to production.

Navigating an incident: From siloed signals to an observability mesh

When a production incident triggers, the goal isn’t to look at everything; it’s to find the data required to quickly assess user impact and localize the root cause. Modern open-standards frameworks like OpenTelemetry organize these data points into core signals:

  • Traces (and Spans): Map the journey of a transaction across distributed services, pointing directly to latency spikes, failures, or broken downstream dependencies.
  • Metrics: Track performance over time (such as CPU consumption or request rates) to flag an anomaly and indicate the scale of impact.
  • Logs: Provide timestamped text records to answer precisely what occurred during a failure event.
  • Profiles: Deliver code-level visibility into resource allocation (like memory and CPU execution hotspots), explaining why a particular service is acting slowly or expensively.

Rather than treating these elements as isolated diagnostic categories, the community is shifting toward an observability mesh. In this interconnected web, metrics point directly to traces, traces embed relevant logs, and logs tie back into resource profiles. During an active incident, this cross-signal connection drastically reduces context-switching friction. For initial identification, teams can rely on a solid foundational bedrock like RED metrics (Rate, Errors, Duration) to immediately isolate the malfunctioning service before digging deeper into the mesh.

Balancing the scales: Zero-code vs. manual instrumentation

How do you cleanly generate and process this data? An open ecosystem relies on standardized layers: semantic conventions for unified labels, entry-point APIs, SDK implementations, and open protocols like OTLP to ship data to a backend. But choosing how to instrument your applications requires evaluating trade-offs between automatic and manual approaches:

Zero-code instrumentation

Zero-code (or automatic) instrumentation allows you to configure language-specific SDKs or utilize platform operators to collect telemetry without ever updating your application’s source code. This is ideal for fast initial rollouts or when managing inaccessible third-party software. Advanced options, such as OpenTelemetry eBPF instrumentation (OBI), deliver excellent request, database, and queue visibility while unlocking the ability to correlate network data with application context. However, zero-code options cannot instrument internal business logic. Furthermore, because it hooks in automatically, it runs the risk of generating massive, unmanageable data volumes if left unconfigured.

Manual instrumentation

Manual instrumentation gives engineers complete control, allowing them to model tracing precision directly around their unique business logic and high-value custom domains. This focus makes it easier to design traces, logs, and metrics together so they tell a coherent story about causality. On the downside, manual instrumentation is time-consuming, introduces long-term maintenance overhead to the codebase, and creates uneven telemetry coverage if development teams lack strict discipline across different programming languages. There is also a distinct risk of over-instrumenting code, which introduces noisy low-value details that slow down active debugging.

Many teams attempt to launch fully manual frameworks from day one, but often stall out and lose executive backing due to slow progress and runaway costs. A practical route is to start with zero-code auto-instrumentation first to instantly establish a telemetry baseline, then look at the data flowing through your pipelines and fine-tune it by progressively layering in manual instrumentation where deep context is needed.

Day 2: Optimization strategies in the pipeline

A photo image of the three speakers at the 'Telemetry that Matters' panel.

Once telemetry collection is widely deployed, optimization should happen directly within your data pipelines. This allows platform teams to adapt quickly to data explosions without forcing application teams to constantly rewrite and redeploy code.

Several practical reduction techniques can be leveraged within an open data collector pipeline:

  • Smart Sampling: Move away from pure random sampling, which can accidentally drop critical error signals. Implement tail-based or pattern-based sampling to ensure you drop boring, successful requests while capturing 100% of anomalies or failures.
  • Managing High Cardinality: Avoid attaching highly unique attributes like user_id or request_id directly to system metrics, which can instantly trigger a dimensional explosion that breaks backend query engines. Instead, use transform processors to mask unique IDs (e.g., transforming specific URL parameters into a generic $ placeholder), drop unneeded attributes, or truncate fine-grained IPs into broader subnets.
  • Cardinality Limiters: Implement pipeline processors that actively monitor incoming attribute values. If a specific label passes a configured uniqueness threshold, the pipeline automatically skips that attribute to prevent metric performance degradation.
  • Log Deduplication: Use processors that identify identical log lines emitted within a small time window, collapsing them into a single record accompanied by an accurate iteration count.
  • Infrastructure Enrichment: Minimize individual agent overhead by decoupling per-service metadata collection. Instead, standardize your semantic conventions and inject common infrastructure or container orchestrator labels once centrally within the collection pipeline.

Tracing the probabilistic frontier: Agentic and AI-driven flows

The panel concluded by addressing a massive architectural paradigm shift: observing Agentic and LLM-driven flows.

Traditional microservices operate on deterministic logic, we look for deterministic  success criteria, explicit network errors, and reproducible failure states. AI systems break these assumptions. They operate in probabilistic environments where the exact same prompt can yield wildly different results, errors are frequently qualitative rather than technical, and “success” is based on the quality of the response.

Consequently, our definition of telemetry must adapt. While standard latency and error rates still matter, observability must expand to look closely at semantic prompt/response patterns and evaluate decision quality rather than just system uptime. Tracing must trace a complex path from user prompt to LLM model, down to iterative tool and agents calls, onto legacy backend microservices, and back up to a final evaluation loop.

Ultimately, this moves our core question away from “Is the application fast?” and toward “Is our system producing cost-effective, reliable, and correct outcomes?”

Key panel takeaways

  1. Correlate Network and Application Data: Incidents don’t cleanly stop at the software layer. Leveraging open tools (like eBPF-driven instrumentation) to seamlessly link core application performance with the actual network transit paths between your user and your cluster is critical for rapid isolation.
  2. Keep an Eye on Emerging Architectural Standards: The community is actively building solutions to alleviate data scaling pain points. Keep an eye on incoming paradigms like retroactive sampling, which allows systems to make a centralized sampling decision first and then pull back the deep, granular trace telemetry on demand.
  3. Optimize Extensibility in the Pipeline: Avoid hardcoding filter rules inside individual services. Rely on scalable collection components to shape, deduplicate, route, and manage your telemetry volume dynamically. Regularly audit your architecture by asking one healthy question: “If this specific data stream stopped flowing tomorrow, what would we actually lose?”