TL;DR
We built and E2E-verified serverless integrations shipping FSx for ONTAP audit logs to 9 observability platforms — all from the same architecture:
For decision makers: 90% cost reduction vs EC2-based collectors ($66/month → $5-8/month), 9 vendor choices instead of 1, 30-minute deploy instead of hours, zero operational burden. Four vendors offer permanent free tiers covering most FSx for ONTAP deployments (New Relic 100 GB, Grafana Cloud 50 GB, Honeycomb 20M events, Sumo Logic 500 MB/day).
┌─────────────────────────────────────────────┐
│ One Architecture, 9 Backends │
├─────────────────────────────────────────────┤
│ │
│ FSx for ONTAP ──→ S3 Access Point │
│ │ │
│ ▼ │
│ EventBridge Scheduler (5 min) │
│ │ │
│ ▼ │
│ Lambda (vendor-specific handler) │
│ │ │
│ ├──→ Datadog (Logs API v2) │
│ ├──→ New Relic (Log API v1) │
│ ├──→ Splunk (HEC) │
│ ├──→ Grafana Cloud (OTLP Gateway) │
│ ├──→ Elastic (Bulk API) │
│ ├──→ Dynatrace (Log Ingest v2) │
│ ├──→ Sumo Logic (HTTP Source) │
│ ├──→ Honeycomb (Events Batch API) │
│ └──→ OTel Collector (OTLP/HTTP) │
│ │
└─────────────────────────────────────────────┘
12 articles, 9 vendors, 3 event sources (audit logs, EMS webhooks, FPolicy), all CloudFormation-templated, all tested with real FSx for ONTAP data. This post distills what we learned.
This is Part 13 — the series finale — of Serverless Observability for FSx for ONTAP.
The Architecture That Survived 9 Integrations
After implementing 9 vendor integrations, the core pattern remained unchanged:
def lambda_handler(event, context):
# 1. Get cached credentials (Secrets Manager + TTL, default 5 min)
creds = auth.get()
# 2. List new files since checkpoint (S3 AP + SSM)
new_keys = list_new_keys(s3_ap_arn, prefix, checkpoint)
# 3. Read, parse, format, ship per file (vendor-specific)
# (Simplified — actual implementation batches events across files
# and respects vendor-specific batch size limits)
for key in new_keys:
logs = read_and_parse(key)
payload = format_for_vendor(logs) # Only this changes per vendor
# 4. Ship with retry (vendor API)
ship_to_vendor(payload, creds)
# 5. Advance checkpoint (only after confirmed delivery)
update_checkpoint(key)
What changes per vendor: only the formatting and HTTP call (~50-100 lines). Everything else — S3 AP access, checkpoint management, DLQ handling, credential caching, retry logic — is shared.
Cross-Vendor Comparison: The Numbers
API Characteristics
| Vendor | Endpoint | Auth Model | Max Batch | Success Code | Firehose |
|---|---|---|---|---|---|
| Datadog | Logs API v2 | Header (DD-API-KEY) |
5 MB / 1000 items | 200 | Yes |
| New Relic | Log API v1 | Header (Api-Key) |
1 MB | 202 | Yes |
| Splunk | HEC | Header (Splunk <token>) |
No hard limit | 200 | Yes (built-in) |
| Grafana | OTLP Gateway | Basic Auth (base64) | ~4 MB | 200 | No |
| Elastic | Bulk API | Header (ApiKey <b64>) |
~10 MB | 200 | No |
| Dynatrace | Log Ingest v2 | Header (Api-Token) |
1 MB | 204 | Via ActiveGate |
| Sumo Logic | HTTP Source | URL-embedded token | 1 MB | 200 | No |
| Honeycomb | Events Batch | Header (x-honeycomb-team) |
5 MB (impl: 100/batch) | 200 | No |
| OTel Collector | OTLP/HTTP | Configurable | Configurable | 200 | No |
Cost at 10 GB/month
| Vendor | Vendor Cost | AWS Infra | Total | Free Tier |
|---|---|---|---|---|
| Sumo Logic | $0 | ~$5 | ~$5 | 500 MB/day |
| Honeycomb | $0 | ~$5 | ~$5 | 20M events/month |
| New Relic | $0 | ~$5 | ~$5 | 100 GB/month |
| Grafana Cloud | $0 | ~$5 | ~$5 | 50 GB logs/month |
| Datadog | ~$15 | ~$5 | ~$20 | Logs: 14-day trial only |
| Dynatrace | ~$25 | ~$5 | ~$30 | 14-day trial |
| Elastic Cloud | ~$95 | ~$5 | ~$100 | 14-day trial |
| Splunk Cloud | ~$150+ | ~$5 | ~$155+ | N/A |
AWS infrastructure cost is consistent across all vendors (~$5/month for Lambda + EventBridge + Secrets Manager). The vendor platform cost is the differentiator.
Data Residency
| Vendor | Tokyo (JP) | US | EU | Self-Hosted |
|---|---|---|---|---|
| Sumo Logic | Yes | Yes | Yes | No |
| Elastic | Yes | Yes | Yes | Yes |
| Dynatrace | Yes (region-specific) | Yes | Yes | Yes (Managed) |
| Datadog | No | Yes | Yes | No |
| New Relic | No (July 2026 planned) | Yes | Yes | No |
| Grafana Cloud | Dedicated only | Yes | Yes | No (Alloy self-hosted) |
| Splunk | No | Yes | Yes | Yes |
| Honeycomb | No | Yes | No | No |
Governance note: This table provides technical awareness for vendor selection. Grafana Cloud offers Tokyo region on Dedicated tier (not Free/Pro). Data residency alone does not constitute regulatory compliance. Evaluate your specific requirements (APPI, GDPR, FISC, ISMAP) with your compliance team. See the Retention Policy Matrix for regulation-to-vendor mapping.
Unique Strengths
| Vendor | Best For |
|---|---|
| Datadog | Full-stack APM correlation, broadest feature set |
| New Relic | Generous free tier (100 GB), NRQL power |
| Splunk | Existing Splunk shops, SPL expertise, Firehose native |
| Grafana Cloud | OTLP-native, LogQL, open-source ecosystem |
| Elastic | Data sovereignty (self-hosted), ECS/SIEM, Kibana |
| Dynatrace | Davis AI root cause analysis, APM correlation |
| Sumo Logic | JP region data residency, generous free tier |
| Honeycomb | High-cardinality analysis (BubbleUp, Heatmaps) |
| OTel Collector | Multi-backend, vendor portability, redaction |
Note on Grafana ecosystem: Grafana Alloy (formerly Grafana Agent) provides a Grafana-native alternative to the OpenTelemetry Collector with the same OTLP compatibility. Grafana Cloud's OTLP Gateway is available on all tiers including Free (US/EU regions only). For Tokyo data residency, Grafana Cloud Dedicated is required.
7 Patterns That Survived All 9 Integrations
1. Polling > Event-Driven (for FSx for ONTAP S3 AP)
FSx for ONTAP S3 Access Points don't support S3 Event Notifications. We evaluated CloudTrail data events as an alternative — however, CloudTrail data events for FSx for ONTAP S3 AP access are not consistently available across all configurations. The 5-minute EventBridge Scheduler poll is simpler, cheaper, and sufficient for audit log use cases where near-real-time (not real-time) delivery is acceptable.
2. Checkpoint-After-Delivery
Never advance the checkpoint before confirming vendor delivery. This single rule prevents data loss across all failure modes:
# CORRECT: checkpoint after confirmed delivery
ship_to_vendor(payload) # Raises on failure
update_checkpoint(key) # Only reached on success
# WRONG: checkpoint before delivery
update_checkpoint(key) # What if ship_to_vendor fails next?
ship_to_vendor(payload) # Data loss if this fails
3. Credential Caching with Reload-on-401
Every vendor integration uses the same SecretBackedAuth pattern: cache credentials at cold start, reload on TTL expiry or 401/403. This handles credential rotation without Lambda redeployment.
4. Reserved Concurrency = 1
The audit poller must not run concurrently (checkpoint race condition). ReservedConcurrentExecutions: 1 is the simplest guard. For higher throughput, move to DynamoDB-based per-object locking.
5. DLQ for Every Async Path
Every template includes a KMS-encrypted DLQ. In 9 integrations, the DLQ caught: vendor outages, credential expiry, malformed files, and Lambda timeouts. Without it, these failures would be silent data loss.
6. Vendor-Specific Batch Limits Matter
The biggest implementation difference across vendors is batch size handling:
| Vendor | Limit | Lambda Behavior |
|---|---|---|
| Honeycomb | 100 events | Split into chunks of 100 |
| Dynatrace / Sumo Logic | 1 MB | Measure payload size, split at boundary |
| Datadog | 5 MB / 1000 items | Dual limit check |
| Elastic | ~10 MB | Rarely hit with audit logs |
7. OTLP as the Universal Format
If you're unsure which vendor you'll use long-term, start with OTLP. The OTel Collector integration (Part 5) proved that a single Lambda producing OTLP can feed Datadog, Grafana, and Honeycomb simultaneously — with zero code changes when adding or removing backends.
Beyond multi-backend delivery, the OTel Collector provides:
- Enrichment: Resource detection, Kubernetes attributes, custom metadata injection
- Sampling: Tail-based sampling for high-volume environments
- Redaction: PII field removal/masking before data leaves your account (see PII Redaction Cookbook)
- Format conversion: OTLP ↔ vendor-native format translation
Verified version: All OTel Collector configurations in this series were tested with OpenTelemetry Collector Contrib v0.152.0. OTel Collector has frequent releases with potential breaking changes — pin your version in production and test before upgrading.
What We'd Do Differently
Start with OTel Collector for Multi-Vendor Evaluation
If evaluating multiple vendors, deploy the OTel Collector path first. It lets you send the same data to 2-3 vendors simultaneously for comparison, without deploying separate Lambda stacks per vendor.
Define SLOs Before Building
We defined Pipeline SLOs after building all 9 integrations. In hindsight, defining "< 10 min delivery latency" and "< 0.01% data loss" upfront would have guided design decisions earlier (e.g., checkpoint granularity, retry policy).
Data Classification First
Audit logs contain PII (usernames, file paths). We documented this in the Data Classification Guide after implementation. For regulated environments, classify fields before choosing a vendor — it may eliminate options that don't support your data residency requirements.
Production Readiness Framework
After 9 integrations, we formalized a 4-level production readiness model:
| Level | What | Go/No-Go to Next |
|---|---|---|
| Level 1: Quickstart | Audit poller + DLQ | Logs arrive, checkpoint advances, DLQ empty 24h |
| Level 2: Operational PoC | + Dashboard + alerts | SLOs met 7 days, security review done |
| Level 3: Production | + DynamoDB ledger + poison-pill | SLOs met 30 days, compliance pack |
| Level 4: Enterprise | + OTel Collector + redaction | Multi-backend, PII redaction, DR tested |
Most PoCs should target Level 2. Production deployments need Level 3. Enterprise pipelines with compliance requirements need Level 4.
Recommended transition timeline:
- Level 1 → Level 2: ~1 week (add dashboards, define SLOs, validate 7-day stability)
- Level 2 → Level 3: ~2-4 weeks (deploy DynamoDB ledger, implement poison-pill handling, complete security review)
- Level 3 → Level 4: ~1-2 months (deploy OTel Collector, implement PII redaction, test DR failover, complete compliance evidence pack)
Full criteria: Pipeline SLO Definitions
Vendor Selection Decision Tree
Start
|
+-- Need JP data residency?
| +-- Yes -> Sumo Logic (JP) or Elastic (self-hosted in Tokyo VPC)
| +-- No |
| v
+-- Need self-hosted (air-gapped)?
| +-- Yes -> Elastic or Splunk
| +-- No |
| v
+-- Already have an observability platform?
| +-- Yes -> Use that vendor (all 9 are supported)
| +-- No |
| v
+-- Budget constraint (free tier needed)?
| +-- Yes -> Sumo Logic (500 MB/day) or Honeycomb (20M events) or New Relic (100 GB)
| +-- No |
| v
+-- Need AI-powered root cause analysis?
| +-- Yes -> Dynatrace (Davis AI)
| +-- No |
| v
+-- Need high-cardinality analysis?
| +-- Yes -> Honeycomb (BubbleUp)
| +-- No |
| v
+-- Need multi-backend / vendor portability?
| +-- Yes -> OTel Collector
| +-- No |
| v
+-- Default -> Datadog (broadest) or Grafana (OTLP-native, open ecosystem)
The FSx for ONTAP S3 AP Constraint That Shaped Everything
The single most impactful technical constraint: FSx for ONTAP S3 Access Points do not support S3 Event Notifications.
This one fact drove:
- EventBridge Scheduler polling pattern (not event-driven)
- SSM Parameter Store checkpointing (track what's been processed)
- Reserved concurrency = 1 (prevent checkpoint races)
- Safety threshold (stop before Lambda timeout)
- MAX_KEYS_PER_RUN (bound processing per invocation)
If FSx for ONTAP S3 APs add event notification support in the future, the architecture could simplify significantly. As of May 2026, this feature is not supported, and the polling pattern is battle-tested across 9 vendors.
Cost Reality: EC2 vs Serverless
The original motivation: replace the EC2-based Splunk pattern (2x EC2 instances) with serverless.
| Metric | EC2 Pattern | Serverless Pattern |
|---|---|---|
| Monthly AWS cost | ~$66 | ~$5-8 |
| OS patching | Required | None |
| Scaling | Manual | Automatic |
| Vendor support | Splunk only | 9 vendors |
| Deploy time | Hours | 30 minutes |
| Recovery from failure | Manual restart | Automatic (DLQ + retry) |
90% cost reduction with zero operational burden. The serverless pattern wins on every dimension except one: real-time latency (EC2 syslog can be sub-second; our poller is 5-minute intervals). For audit logs, 5 minutes is acceptable. For real-time needs, use the FPolicy path (< 30 seconds).
What's Next
This series covered the foundation. The project continues with:
- Phase 3 (delivered): Multi-account deployment (AWS Organizations + StackSets)
- Phase 3 (delivered): DynamoDB object ledger for per-object processing state
- Phase 3 (delivered): SQS buffering pattern for backpressure handling
- Phase 3 (delivered): Cross-region DR with Active-Passive failover
- Phase 3 (delivered): OTel Collector PII redaction cookbook (7 recipes for APPI/GDPR)
- Phase 4: Terraform module equivalents
- Phase 4: CDK construct library
See the full ROADMAP.
Resources
- GitHub: github.com/Yoshiki0705/fsxn-observability-integrations
- Pipeline SLO: docs/en/pipeline-slo.md
- Data Classification: docs/en/data-classification.md
- S3 AP Throughput Benchmark: docs/en/s3ap-throughput-benchmark.md
- Vendor Comparison: docs/en/vendor-comparison.md
- Partner FAQ: docs/en/partner-faq.md
- Workshop Guide: docs/en/workshop-hands-on-half-day.md
- Compliance Evidence Pack: docs/en/compliance-evidence-pack.md
Series Navigation
- Part 1: Why Your FSx for ONTAP Audit Logs Deserve Better Than EC2
- Part 2: Shipping FSx for ONTAP Logs to Datadog — The Serverless Way
- Part 3: Event-Driven Ransomware Detection with ONTAP ARP + Datadog
- Part 4: FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate
- Part 5: Escape Vendor Lock-in: Multi-Backend Log Delivery with OTel Collector for FSx for ONTAP.
- Part 6: Direct-to-Grafana: Shipping FSx for ONTAP Logs to Grafana Cloud Loki via OTLP Gateway
- Part 7: Ship FSx for ONTAP Audit Logs to New Relic via Serverless Lambda Pipeline
- Part 8: EC2 to Serverless: Modernizing FSx for ONTAP Splunk Integration
- Part 9: Data Sovereignty: FSx for ONTAP Logs in Your VPC with Elastic
- Part 10: High-Cardinality File Access Analysis with Honeycomb + OTel
- Part 11: AI-Powered Root Cause: Correlating File Access with APM via Dynatrace
- Part 12: FSx for ONTAP Audit Logs with Data Residency in your region with Sumo Logic
- Part 13: 9 Vendors, One Architecture (this post)
Thank you for following this series. If you've deployed any of these integrations, I'd love to hear about your experience — drop a comment or open a GitHub issue.
GitHub: github.com/Yoshiki0705/fsxn-observability-integrations
























