惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
人人都是产品经理
人人都是产品经理
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
V
V2EX
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
WordPress大学
WordPress大学
D
Docker
S
SegmentFault 最新的问题
博客园 - 聂微东
美团技术团队
Apple Machine Learning Research
Apple Machine Learning Research
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Last Week in AI
Last Week in AI
M
MIT News - Artificial intelligence
F
Fortinet All Blogs
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
The GitHub Blog
The GitHub Blog
GbyAI
GbyAI
L
LangChain Blog
Vercel News
Vercel News
博客园 - 叶小钗
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
H
Help Net Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
The Cloudflare Blog
Engineering at Meta
Engineering at Meta
T
Threat Research - Cisco Blogs
T
Threatpost
Scott Helme
Scott Helme
T
Tailwind CSS Blog
Latest news
Latest news
Stack Overflow Blog
Stack Overflow Blog
Blog — PlanetScale
Blog — PlanetScale
The Register - Security
The Register - Security
罗磊的独立博客
P
Proofpoint News Feed
腾讯CDC
S
Schneier on Security
雷峰网
雷峰网
A
About on SuperTechFans
T
Tenable Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
博客园_首页
有赞技术团队
有赞技术团队
K
Kaspersky official blog

DEV Community

China Payment Terms: T/T, LC, Escrow When the LLM Refuses: A Fallback Chain That Salvages Most Refusals PCB Assembly in China: Buyer's Guide How to Source Electronics from China We Built a Real-Time AI Research Collaborator Into our JOT writing tool How to Give Claude Access to Snowflake Without Exposing PII The Agent that grows with you What Building Agent_Sudo Taught Me About AI Agent Security (Before I Found Any Users) Abortion Rights Matter PySide6 vs Electron: Why I shipped a 118 MB Windows desktop tool, not a 250 MB cross-platform one MCP Servers for BI Tools: Looker, Tableau, Power BI, Mode (2026) My AI Agent Kept Lying to Me. Then It Tried to Trick Me. Atlan Alternatives: 6 Open-Source Data Catalogs Compared (2026) How I stopped wrestling with regex and started using AI for data extraction How I Built an AI Assistant That Grows Its Own Tools Interactive Floor Plans for Real Estate Developers — Why Static PDFs Are Dead Vue slot to React: How does VuReact handle it? I Found 54 Reliability Issues in My 14-Agent AI System — Here's What Broke I Built 24 Free Browser Tools in 6 Weeks — Here's What I'd Do Differently Octorato: an open-source AI agent OS with built-in per-client FinOps RAG Explained for Beginners: How AI Assistants Stop Making Things Up Curing LLM Hallucinations: Building a Production-Grade Medical RAG with PubMed and Hybrid Search I don't want to write HTML or fight global CSS, so I built a TypeScript DSL FSx for ONTAP Audit Logs with Data Residency in your region with Sumo Logic Someone contributed 3,324 lines to our open K-12 AI lesson library — a 6-unit series asking students to interrogate AI, not just use it My website has two audiences now. I only built for one of them. AI-Powered Root Cause: Correlating File Access with APM via Dynatrace Opus 4.8 ships Dynamic Workflows — hundreds of parallel subagents per session. Read this before you wire it into prod. We Cut $120,000 from Our Cloud Bill Without Sacrificing Reliability Stress Concentration Factor: Why a Small Hole Can Triple Local Stress Streaming an LLM response, in 4 GIFs High-Cardinality File Access Analysis with Honeycomb + OTel Introduction to n8n: Beginner Course Summary What Happens in 2 Milliseconds: Anatomy of a Single HTTP Request Through a Production WAF Why Veltrix Thought It Could Buy Its Way Out of a Distributed Lock Problem 10 Free Developer Utility Tools That Run Entirely in Your Browser 《认知革命播客》:个人AI基础设施的深度实践与安全思辨 Weekend Supervised Vibe Coding Why I Run Claude Code Plugins for Brand Voice Enforcement x.klickd v4.1: Portable, Encrypted, Human-Governed Memory for AI Workflows That Don’t Reset EC2 to Serverless: Modernizing FSx for ONTAP Splunk Integration AI Can Introduce Complexity Without Introducing Noise — But Only If the Repo Knows How to Hold the Complexity 🛠️Building My First AI Agent with Hermes Agent 🤖 I Built a Flutter App with Firebase + MercadoPago and Turned It Into a Starter Kit (Real Production Code) Hermes Commander: An Autonomous Research Assistant Powered by Hermes Agent 🧠 Why Webhooks Fail Behind Firewalls (And Why Every Fix Has the Same Problem) Have Antigravity review prompts update themselves when your codebase changes 5 Browser-Based Image Tools That Work Entirely Offline — No Upload Required 7 Free PDF Tools That Never Upload Your Files — All Client-Side Building a Cloud SIEM from Scratch with AWS Lambda and EventBridge Compound Engineering: A Plugin That Makes Your AI Coding Agent Smarter Over Time "I Reviewed 50 Dev Resumes — These 5 Mistakes Killed Their Chances" How to Test Your SPF Record for Common Mistakes (Step by Step) Building a Real-Time Financial Sentiment API: Handling Noise and LLM Hallucinations Tokyo Transit: How MCP Helped Me Fix a Broken Multi-Agent System Try the Tech Radar #2 — Markdown Typst Converter (Typst's Syntax Is Closer to Markdown Than LaTeX) 🩺 Inside Med AI: How We Engineered a 100M Token Hyper-Scale Clinical Intelligence Suite 🚀 Common Mistakes New Developers Always Make & How to Avoid Them Effectively Session Management, Rate Limiting & Caching using Redis Why Accuracy Is Not Enough: Evaluation Metrics Every AI Engineer Should Understand How I Built One Building Instagram Data Workflows with HikerAPI (Without Maintaining Scrapers) Claude Code can't open my browser. Cowork can't run my tests. So I wired them together. AGTP: A Transport Protocol Built for Agents I built Snipworth a Chrome extension to turn code into shareable images — and keep them for later My Friend's Two Android Apps, Three Months Lost, and Why We Built onTest Blue-Green Deployments Are Invisible. I Made Mine Visible. Here Is How. Need your attention on my current project Why a deleted backup Lambda kept billing 9,400 EBS snapshots Deterministic Telemetry Ingestion Pipeline for GridLoqer Your Deployments Are Causing Downtime. Mine Do Not. Here Is Why How I Built a 7-Layer NL2SQL Guardrail Stack for a Fortune 500 Enterprise Identity in Web3 The Trap of "Perfect" Architecture: What Building a Shopping Cart Taught Me The Browser Boundary Model: APIs, CORS, Cookies, JSON, Files, and SEO ModelChain: Measurable LLM Router with Adaptive Model Selection, Real-Time Scoring, Budget Guards and Failover for Node.js, Edge and Browser I Built a 25-Agent Polish Parliament That Drafts Bills With Real Legal Citations KeyMesh: Zero-Runtime-Dependency API Key Rotation, Circuit Breaker and Failover for Production LLM Applications in Node.js Claude Code's workflow docs are a menu. Building a home server with a mini PC Stop Shipping AI Slop: Build an Anti-Slop Harness Around Your LLM I built an open source SDK to catch AI agent regressions before they ship. Great Stack to Doesn't Work #3 — Redis: "99% Cache Hit Ratio, System Down" The Bug That Passes Every Toolchain Check: Circular Dependencies in JavaScript Great Stack to Doesn't Work Bonus: SQL vs NoSQL: Which One in 2026? Great Stack to Doesn't Work #2 — Kafka: "Where Did My Messages Go?" I built a detention-pay calculator for truckers in a day — unglamourous niches beat another AI wrapper The Same AI Model Can Perform 6x Better: Here's Why SQL-like Queries in FSRS Plugin for Obsidian [Imposter syndrome] Back to the beginning (DevSecOps path) How to Build a Kundali App with Free Vedic Astrology API — Step by Step Ideias Valem Muito Menos do Que Você Imagina [PT-BR] cgroups and Namespaces — The Linux Kernel's Building Blocks Behind Containers Hermes Blueprint: A Multi-Agent Hedge Fund Morning Briefing System Why We Abandoned Java for Our Treasure Hunt Engine and Embraced the Complexity of Rust Building a RAG System in Rust with Qdrant, Rig, and gRPC 🦀 Ecommerce Search API: Add Visual and Semantic Search Bots read fast pages too: what we reprioritised after an AI-crawler audit Tu navegador te conoce mejor de lo que crees: privacidad en 2026 From Zero to DevOps in Pakistan: My Real Journey With No CS Degree
9 Services, One Architecture: What We Learned Shipping FSx for ONTAP Logs to Every Major Observability Platform
Yoshiki Fujiwara(藤原 善基)@AWS Community Builder · 2026-05-31 · via DEV Community

TL;DR

We built and E2E-verified serverless integrations shipping FSx for ONTAP audit logs to 9 observability platforms — all from the same architecture:

For decision makers: 90% cost reduction vs EC2-based collectors ($66/month → $5-8/month), 9 vendor choices instead of 1, 30-minute deploy instead of hours, zero operational burden. Four vendors offer permanent free tiers covering most FSx for ONTAP deployments (New Relic 100 GB, Grafana Cloud 50 GB, Honeycomb 20M events, Sumo Logic 500 MB/day).

                    ┌─────────────────────────────────────────────┐
                    │         One Architecture, 9 Backends        │
                    ├─────────────────────────────────────────────┤
                    │                                             │
                    │  FSx for ONTAP ──→ S3 Access Point          │
                    │       │                                     │
                    │       ▼                                     │
                    │  EventBridge Scheduler (5 min)              │
                    │       │                                     │
                    │       ▼                                     │
                    │  Lambda (vendor-specific handler)           │
                    │       │                                     │
                    │       ├──→ Datadog (Logs API v2)            │
                    │       ├──→ New Relic (Log API v1)           │
                    │       ├──→ Splunk (HEC)                     │
                    │       ├──→ Grafana Cloud (OTLP Gateway)     │
                    │       ├──→ Elastic (Bulk API)               │
                    │       ├──→ Dynatrace (Log Ingest v2)        │
                    │       ├──→ Sumo Logic (HTTP Source)         │
                    │       ├──→ Honeycomb (Events Batch API)     │
                    │       └──→ OTel Collector (OTLP/HTTP)       │
                    │                                             │
                    └─────────────────────────────────────────────┘

Enter fullscreen mode Exit fullscreen mode

12 articles, 9 vendors, 3 event sources (audit logs, EMS webhooks, FPolicy), all CloudFormation-templated, all tested with real FSx for ONTAP data. This post distills what we learned.

This is Part 13 — the series finale — of Serverless Observability for FSx for ONTAP.


The Architecture That Survived 9 Integrations

After implementing 9 vendor integrations, the core pattern remained unchanged:

def lambda_handler(event, context):
    # 1. Get cached credentials (Secrets Manager + TTL, default 5 min)
    creds = auth.get()

    # 2. List new files since checkpoint (S3 AP + SSM)
    new_keys = list_new_keys(s3_ap_arn, prefix, checkpoint)

    # 3. Read, parse, format, ship per file (vendor-specific)
    #    (Simplified — actual implementation batches events across files
    #     and respects vendor-specific batch size limits)
    for key in new_keys:
        logs = read_and_parse(key)
        payload = format_for_vendor(logs)  # Only this changes per vendor

    # 4. Ship with retry (vendor API)
        ship_to_vendor(payload, creds)

    # 5. Advance checkpoint (only after confirmed delivery)
        update_checkpoint(key)

Enter fullscreen mode Exit fullscreen mode

What changes per vendor: only the formatting and HTTP call (~50-100 lines). Everything else — S3 AP access, checkpoint management, DLQ handling, credential caching, retry logic — is shared.

Cross-Vendor Comparison: The Numbers

API Characteristics

Vendor Endpoint Auth Model Max Batch Success Code Firehose
Datadog Logs API v2 Header (DD-API-KEY) 5 MB / 1000 items 200 Yes
New Relic Log API v1 Header (Api-Key) 1 MB 202 Yes
Splunk HEC Header (Splunk <token>) No hard limit 200 Yes (built-in)
Grafana OTLP Gateway Basic Auth (base64) ~4 MB 200 No
Elastic Bulk API Header (ApiKey <b64>) ~10 MB 200 No
Dynatrace Log Ingest v2 Header (Api-Token) 1 MB 204 Via ActiveGate
Sumo Logic HTTP Source URL-embedded token 1 MB 200 No
Honeycomb Events Batch Header (x-honeycomb-team) 5 MB (impl: 100/batch) 200 No
OTel Collector OTLP/HTTP Configurable Configurable 200 No

Cost at 10 GB/month

Vendor Vendor Cost AWS Infra Total Free Tier
Sumo Logic $0 ~$5 ~$5 500 MB/day
Honeycomb $0 ~$5 ~$5 20M events/month
New Relic $0 ~$5 ~$5 100 GB/month
Grafana Cloud $0 ~$5 ~$5 50 GB logs/month
Datadog ~$15 ~$5 ~$20 Logs: 14-day trial only
Dynatrace ~$25 ~$5 ~$30 14-day trial
Elastic Cloud ~$95 ~$5 ~$100 14-day trial
Splunk Cloud ~$150+ ~$5 ~$155+ N/A

AWS infrastructure cost is consistent across all vendors (~$5/month for Lambda + EventBridge + Secrets Manager). The vendor platform cost is the differentiator.

Data Residency

Vendor Tokyo (JP) US EU Self-Hosted
Sumo Logic Yes Yes Yes No
Elastic Yes Yes Yes Yes
Dynatrace Yes (region-specific) Yes Yes Yes (Managed)
Datadog No Yes Yes No
New Relic No (July 2026 planned) Yes Yes No
Grafana Cloud Dedicated only Yes Yes No (Alloy self-hosted)
Splunk No Yes Yes Yes
Honeycomb No Yes No No

Governance note: This table provides technical awareness for vendor selection. Grafana Cloud offers Tokyo region on Dedicated tier (not Free/Pro). Data residency alone does not constitute regulatory compliance. Evaluate your specific requirements (APPI, GDPR, FISC, ISMAP) with your compliance team. See the Retention Policy Matrix for regulation-to-vendor mapping.

Unique Strengths

Vendor Best For
Datadog Full-stack APM correlation, broadest feature set
New Relic Generous free tier (100 GB), NRQL power
Splunk Existing Splunk shops, SPL expertise, Firehose native
Grafana Cloud OTLP-native, LogQL, open-source ecosystem
Elastic Data sovereignty (self-hosted), ECS/SIEM, Kibana
Dynatrace Davis AI root cause analysis, APM correlation
Sumo Logic JP region data residency, generous free tier
Honeycomb High-cardinality analysis (BubbleUp, Heatmaps)
OTel Collector Multi-backend, vendor portability, redaction

Note on Grafana ecosystem: Grafana Alloy (formerly Grafana Agent) provides a Grafana-native alternative to the OpenTelemetry Collector with the same OTLP compatibility. Grafana Cloud's OTLP Gateway is available on all tiers including Free (US/EU regions only). For Tokyo data residency, Grafana Cloud Dedicated is required.

7 Patterns That Survived All 9 Integrations

1. Polling > Event-Driven (for FSx for ONTAP S3 AP)

FSx for ONTAP S3 Access Points don't support S3 Event Notifications. We evaluated CloudTrail data events as an alternative — however, CloudTrail data events for FSx for ONTAP S3 AP access are not consistently available across all configurations. The 5-minute EventBridge Scheduler poll is simpler, cheaper, and sufficient for audit log use cases where near-real-time (not real-time) delivery is acceptable.

2. Checkpoint-After-Delivery

Never advance the checkpoint before confirming vendor delivery. This single rule prevents data loss across all failure modes:

# CORRECT: checkpoint after confirmed delivery
ship_to_vendor(payload)  # Raises on failure
update_checkpoint(key)   # Only reached on success

# WRONG: checkpoint before delivery
update_checkpoint(key)   # What if ship_to_vendor fails next?
ship_to_vendor(payload)  # Data loss if this fails

Enter fullscreen mode Exit fullscreen mode

3. Credential Caching with Reload-on-401

Every vendor integration uses the same SecretBackedAuth pattern: cache credentials at cold start, reload on TTL expiry or 401/403. This handles credential rotation without Lambda redeployment.

4. Reserved Concurrency = 1

The audit poller must not run concurrently (checkpoint race condition). ReservedConcurrentExecutions: 1 is the simplest guard. For higher throughput, move to DynamoDB-based per-object locking.

5. DLQ for Every Async Path

Every template includes a KMS-encrypted DLQ. In 9 integrations, the DLQ caught: vendor outages, credential expiry, malformed files, and Lambda timeouts. Without it, these failures would be silent data loss.

6. Vendor-Specific Batch Limits Matter

The biggest implementation difference across vendors is batch size handling:

Vendor Limit Lambda Behavior
Honeycomb 100 events Split into chunks of 100
Dynatrace / Sumo Logic 1 MB Measure payload size, split at boundary
Datadog 5 MB / 1000 items Dual limit check
Elastic ~10 MB Rarely hit with audit logs

7. OTLP as the Universal Format

If you're unsure which vendor you'll use long-term, start with OTLP. The OTel Collector integration (Part 5) proved that a single Lambda producing OTLP can feed Datadog, Grafana, and Honeycomb simultaneously — with zero code changes when adding or removing backends.

Beyond multi-backend delivery, the OTel Collector provides:

  • Enrichment: Resource detection, Kubernetes attributes, custom metadata injection
  • Sampling: Tail-based sampling for high-volume environments
  • Redaction: PII field removal/masking before data leaves your account (see PII Redaction Cookbook)
  • Format conversion: OTLP ↔ vendor-native format translation

Verified version: All OTel Collector configurations in this series were tested with OpenTelemetry Collector Contrib v0.152.0. OTel Collector has frequent releases with potential breaking changes — pin your version in production and test before upgrading.

What We'd Do Differently

Start with OTel Collector for Multi-Vendor Evaluation

If evaluating multiple vendors, deploy the OTel Collector path first. It lets you send the same data to 2-3 vendors simultaneously for comparison, without deploying separate Lambda stacks per vendor.

Define SLOs Before Building

We defined Pipeline SLOs after building all 9 integrations. In hindsight, defining "< 10 min delivery latency" and "< 0.01% data loss" upfront would have guided design decisions earlier (e.g., checkpoint granularity, retry policy).

Data Classification First

Audit logs contain PII (usernames, file paths). We documented this in the Data Classification Guide after implementation. For regulated environments, classify fields before choosing a vendor — it may eliminate options that don't support your data residency requirements.

Production Readiness Framework

After 9 integrations, we formalized a 4-level production readiness model:

Level What Go/No-Go to Next
Level 1: Quickstart Audit poller + DLQ Logs arrive, checkpoint advances, DLQ empty 24h
Level 2: Operational PoC + Dashboard + alerts SLOs met 7 days, security review done
Level 3: Production + DynamoDB ledger + poison-pill SLOs met 30 days, compliance pack
Level 4: Enterprise + OTel Collector + redaction Multi-backend, PII redaction, DR tested

Most PoCs should target Level 2. Production deployments need Level 3. Enterprise pipelines with compliance requirements need Level 4.

Recommended transition timeline:

  • Level 1 → Level 2: ~1 week (add dashboards, define SLOs, validate 7-day stability)
  • Level 2 → Level 3: ~2-4 weeks (deploy DynamoDB ledger, implement poison-pill handling, complete security review)
  • Level 3 → Level 4: ~1-2 months (deploy OTel Collector, implement PII redaction, test DR failover, complete compliance evidence pack)

Full criteria: Pipeline SLO Definitions

Vendor Selection Decision Tree

Start
  |
  +-- Need JP data residency?
  |   +-- Yes -> Sumo Logic (JP) or Elastic (self-hosted in Tokyo VPC)
  |   +-- No  |
  |           v
  +-- Need self-hosted (air-gapped)?
  |   +-- Yes -> Elastic or Splunk
  |   +-- No  |
  |           v
  +-- Already have an observability platform?
  |   +-- Yes -> Use that vendor (all 9 are supported)
  |   +-- No  |
  |           v
  +-- Budget constraint (free tier needed)?
  |   +-- Yes -> Sumo Logic (500 MB/day) or Honeycomb (20M events) or New Relic (100 GB)
  |   +-- No  |
  |           v
  +-- Need AI-powered root cause analysis?
  |   +-- Yes -> Dynatrace (Davis AI)
  |   +-- No  |
  |           v
  +-- Need high-cardinality analysis?
  |   +-- Yes -> Honeycomb (BubbleUp)
  |   +-- No  |
  |           v
  +-- Need multi-backend / vendor portability?
  |   +-- Yes -> OTel Collector
  |   +-- No  |
  |           v
  +-- Default -> Datadog (broadest) or Grafana (OTLP-native, open ecosystem)

Enter fullscreen mode Exit fullscreen mode

The FSx for ONTAP S3 AP Constraint That Shaped Everything

The single most impactful technical constraint: FSx for ONTAP S3 Access Points do not support S3 Event Notifications.

This one fact drove:

  • EventBridge Scheduler polling pattern (not event-driven)
  • SSM Parameter Store checkpointing (track what's been processed)
  • Reserved concurrency = 1 (prevent checkpoint races)
  • Safety threshold (stop before Lambda timeout)
  • MAX_KEYS_PER_RUN (bound processing per invocation)

If FSx for ONTAP S3 APs add event notification support in the future, the architecture could simplify significantly. As of May 2026, this feature is not supported, and the polling pattern is battle-tested across 9 vendors.

Cost Reality: EC2 vs Serverless

The original motivation: replace the EC2-based Splunk pattern (2x EC2 instances) with serverless.

Metric EC2 Pattern Serverless Pattern
Monthly AWS cost ~$66 ~$5-8
OS patching Required None
Scaling Manual Automatic
Vendor support Splunk only 9 vendors
Deploy time Hours 30 minutes
Recovery from failure Manual restart Automatic (DLQ + retry)

90% cost reduction with zero operational burden. The serverless pattern wins on every dimension except one: real-time latency (EC2 syslog can be sub-second; our poller is 5-minute intervals). For audit logs, 5 minutes is acceptable. For real-time needs, use the FPolicy path (< 30 seconds).

What's Next

This series covered the foundation. The project continues with:

See the full ROADMAP.

Resources

Series Navigation


Thank you for following this series. If you've deployed any of these integrations, I'd love to hear about your experience — drop a comment or open a GitHub issue.

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations