惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

Datadog | The Monitor blog

Reduce CVE noise with OpenVEX assessments in Datadog How we made a SQL query optimization agent 59% more accurate using autoresearch and LLM Observability How to audit and clean up monitors effectively Diagnose slow PostgreSQL queries faster with explain plan correlation Explore Datadog metrics with Natural Language Queries Toto 2.0: Time series forecasting enters the scaling era Simplify micro-frontend observability with Datadog RUM Attribute AI costs across providers with Datadog Cloud Cost Management Diagnose and resolve database performance issues faster with Database Investigator Datadog for Government achieves FedRAMP® High certification Analyze cloud costs with flexible spreadsheets in Datadog Sheets Inside Datadog’s AI Research Lab: Meet two PhD candidates behind Toto Connect triage and investigation in a single workflow with Datadog Cloud SIEM This Month in Datadog - April 2026 Monitor and optimize Supabase query performance with Datadog Database Monitoring Add dynamically updating context to logs with Reference Tables and Observability Pipelines Introducing ARFBench: A time series question-answering benchmark based on real incidents The product signal latency gap slowing your growth Test network paths with TCP, UDP, and ICMP in Datadog Turn developer feedback into operational insight with Datadog Forms and Sheets How to investigate cloud credential compromise with Bits AI Security Analyst Evaluate, optimize, and secure your Google Cloud AI stack with Datadog Bringing observability data hosting to the UK on AWS Identify and fix code issues faster with Datadog’s Azure DevOps Source Code integration Steganography at scale: Embedding share URLs in Datadog widget screenshots Every team should be A/B testing Centralize observability management with Datadog Governance Console Spotting CI/CD misconfigurations before the bots do: Securing GitHub Actions with Datadog IaC Security Route OTel data from AI apps to ClickHouse and Datadog using Observability Pipelines Manage service tracing across hosts with Single Step Instrumentation rules Offline evaluation for AI agents: Best practices Detect runtime threats in Python Lambda functions with Datadog AAP Introducing our open source AI-native SAST Instrument and monitor Boomi integration flows with OpenTelemetry and Datadog Not all index scans are equal: How we cut query latency by over 99% Platform engineering metrics: What to measure and what to ignore Integrate Recorded Future threat intelligence with Datadog Cloud SIEM CI/CD security: threat modeling using a MITRE-style threat matrix CI/CD security: How to secure your GitHub ecosystem Ingress NGINX is EOL: A practical guide for migrating to Kubernetes Gateway API How we built a real-world evaluation platform for autonomous SRE agents at scale Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA Introducing the Datadog Code Security MCP Capture and analyze custom heatmaps in Session Replay Understand session replays faster with AI summaries and smart chapters Monitor ClickHouse query performance with Datadog Database Monitoring How we designed empathetic alert sounds for on-call engineers Search and act across Datadog to resolve issues faster with Bits Assistant Measure the business impact of every product change with Datadog Experiments Analyzing round trip query latency Configuring JavaScript caches for better performance Introducing Bits AI Dev Agent for Code Security Datadog achieves ISO 42001 certification for responsible AI Monitor Nutanix clusters, hosts, and VMs with Datadog Monitor Juniper Mist in Datadog A new Host Map for modern infrastructure When upserts don't update but still write: Debugging Postgres performance at scale Annotate traces to improve LLM quality with Datadog LLM Observability What's new in Cloud SIEM: AI-powered investigations, enhanced threat intelligence, and scalable security operations Explore Kubernetes with native OpenTelemetry data Monitor Oracle Fusion Cloud Applications with Datadog Announcing the Datadog Terraform provider v4.0.0 Scaling Kubernetes workloads on custom metrics How to design cloud environments for AI-powered threat analysis Monitor Aruba Central in Datadog How we centralize and remediate risks with Datadog Case Management Accelerate incident response with Datadog and ServiceNow Monitor your application and network load balancer logs Understanding Karpenter architecture for Kubernetes autoscaling Tools for collecting metrics and logs from Karpenter Monitor Karpenter with Datadog What your product data is actually saying Key metrics for monitoring Karpenter Securing Datadog's platform in the AI age: The role of observability data Closing the verification loop: Observability-driven harnesses for building with agents When an AI agent came knocking: Catching malicious contributions in Datadog’s open source repos Closing the verification loop, Part 2: Fully autonomous optimization Four ways engineering teams use the Datadog MCP Server to power AI agents Approaching your observability migration with the right mindset Meet the new Bits AI SRE: Deeper reasoning, twice as fast Designing MCP tools for agents: Lessons from building Datadog's MCP server Key learnings from the 2026 State of DevSecOps study Use plain English to query your multi-cloud infrastructure in Resource Catalog Simplifying troubleshooting across the user journey with Datadog Synthetic Monitoring Protect your OCI resources with Datadog Cloud Security This Month in Datadog - February 2026 Fine-tune Toto for turbocharged forecasts Amazon EC2 security: How misconfigured and public AMIs expand your cloud attack surface Enable end-to-end visibility into your Java apps with a single command Measure and improve mobile app startup performance with Datadog RUM Evaluating our AI Guard application to improve quality and control cost Identify untested code across every level of your codebase Make use of guardrail metrics and stop babysitting your releases Monitor Versa Networks SD-WAN performance in Datadog How we reduced the size of our Agent Go binaries by up to 77% Improve performance and reliability with APM Recommendations Remediate transitive vulnerabilities faster with Datadog Software Composition Analysis Generate audit-ready vulnerability and compliance reports with Datadog Sheets Monitor Fortinet FortiManager performance in Datadog Improve test coverage across codebases with Datadog Code Coverage
Build secure and scalable Azure serverless applications with the Well-Architected Framework
2025-08-12 · via Datadog | The Monitor blog
Jordan Obey

Jordan Obey

Serverless platforms like Azure Functions and Azure Container Apps make it easier to scale your applications without managing infrastructure. But successful serverless apps require thoughtful planning. They must be designed to account for cold starts, unpredictable scaling behavior, and ephemeral compute lifecycles, all while ensuring secure data handling and end-to-end observability across highly distributed components.

The Azure Well-Architected Framework provides five pillars to guide this design process. In this post, we’ll explore how to build more robust serverless solutions by applying its principles to common architectural challenges.

We’ll also show you how Datadog supports each of these pillars by providing the insights needed to ensure that your serverless applications are secure, cost-efficient, and behaving as expected.

Reliability

In a serverless context, reliability is particularly important because you’re relying on several managed services that scale independently and operate on demand. If not properly isolated, outages or performance issues in any one component—whether it’s your API layer, messaging queue, or background function—can ripple through your application.

This section covers how to achieve reliability within your serverless workloads by ensuring high availability and managing function failures gracefully.

Ensure high availability

High availability refers to an application’s ability to remain operational even when part of the system fails. Azure Functions are hosted in multiple regions and availability zones by default, offering built-in high availability. However, this default behavior can be affected when you integrate other components—such as databases, queues, or virtual networks—which may require manual configuration.

For instance, if you deploy Azure Functions inside a virtual network (VNet), your app is placed in a single availability zone within that VNet’s region by default. This means that if that zone experiences an outage, your Functions app may become unavailable, even if the rest of the region remains healthy. To mitigate this risk, you should provision subnets across multiple availability zones and configure your infrastructure accordingly to preserve high availability and prevent a single point of failure from impacting your entire application.

As another example, Azure also recommends enabling multi-region writes in Azure Cosmos DB to keep your database accessible if a region becomes unavailable.

Manage function failures gracefully

Even with high availability, some functions may still fail due to unexpected errors, unhandled exceptions, or service timeouts. You’ll need to account for these failures and ensure that services retry safely and consistently.

For instance, Azure Functions triggered by queues or event streams automatically retry when a failure occurs. This means they may be invoked multiple times for the same event, unintentionally creating duplicate emails or repeated charges. To avoid this issue, ensure that functions are idempotent, meaning they can run multiple times with the same input and still produce the same result. For example, a checkout function should verify whether a payment has already been processed before submitting it again.

To handle persistent failures, you can configure a dead-letter queue (DLQ) to capture messages that have failed after a set number of retries. This prevents them from being retried indefinitely and allows for later analysis, remediation, or reprocessing. DLQs are especially critical when dealing with external systems or non-idempotent operations.

For functions that trigger synchronously (such as HTTP APIs), you’ll need to build retry logic into the calling service. Be sure that your function returns appropriate status codes so consumers can distinguish between client-side and server-side errors. Azure API Management or Application Gateway can help route and shape this traffic, but the function’s runtime code should still handle errors explicitly.

Performance

Serverless platforms like Azure are designed to scale quickly and handle bursts of traffic, but that doesn’t automatically guarantee fast response times. Cold starts, inefficient logic, and resource misconfiguration can still slow down your workloads, especially when usage is unpredictable and prone to spikes. The performance pillar focuses on using computing resources efficiently to deliver responsive user experiences at any scale.

Azure provides several tools to help serverless workloads perform well under pressure, including autoscaling, caching, and message queues. In the sections below, we’ll cover how to get the most out of these features and ensure that your Azure-hosted serverless workloads can scale and respond quickly to requests.

Minimize cold starts

Cold starts occur when a function or container has to initialize from scratch before responding to a request. This added startup time can significantly increase response latency, which is particularly detrimental for latency-sensitive services like login, search, or checkout APIs.

If you are using Azure Container Apps, you can reduce cold start latency by configuring a minimum replica count when you set scaling rules. This ensures that some containers remain warm and ready to respond immediately, even during low-traffic periods. For example, if you run an ecommerce API that sees unpredictable spikes during a sale or marketing campaign, configuring a minimum of one or two pre-warmed replicas helps absorb sudden traffic without forcing the platform to start from zero. A YAML-based configuration might include a scale block that looks like the following:

scale:

minReplicas: 2

maxReplicas: 10

rules:

- name: http-rule

http:

metadata:

concurrentRequests: "50"

The scale block defines how your container app adjusts its number of replicas based on incoming traffic and other conditions. In the scale block above, minReplicas: 2 keeps two containers running at all times to reduce cold start latency, while maxReplicas: 10 allows the app to scale up when needed. The http-rule tells Azure to add more replicas when any instance handles more than 50 concurrent HTTP requests.

If you’re using Azure Functions instead of Azure Container Apps, you can mitigate cold starts by running your functions on the Premium plan, which offers always ready and prewarmed instances, or by enabling provisioned concurrency. The Premium plan is well suited for APIs and interactive endpoints that require consistent performance. For example, a customer authentication service that runs on the Premium plan can avoid the latency spikes that might otherwise frustrate users logging in during peak hours. You may also consider running workloads with strict latency requirements in Azure Container Apps or long-running containers, where you can eliminate cold starts entirely and fine-tune performance at the runtime level.

Process functions efficiently

In serverless environments, where you’re billed based on execution time and resource consumption, streamlining your functions has both performance and cost benefits. Keep serverless functions efficient by scoping each one to a specific task. Scoping your functions to a single task simplifies debugging, improves cold start times, and makes scaling more predictable. You can also use monitoring tools like Datadog to get insight into metrics like memory usage, execution duration, concurrent invocations, and error rates, to help you ensure that each function is right-sized and tuned for its workload.

For example, it often makes sense to group related tasks—like validating input and saving data to a database—into a single function because they belong to the same logical unit of work. An action like sending a confirmation email, however, can be separated into its own function and triggered asynchronously. This keeps the critical path simple and reliable while allowing side effects to scale and fail independently.

Use batching and caching when possible

Caching and batching are effective ways to reduce latency and improve the performance of serverless applications. Caching allows your functions to quickly retrieve frequently used data—such as product details, user profiles, or configuration settings—without needing to repeatedly query a backend service. You can use Azure Cache for Redis for durable, distributed caching, or in-memory caching for functions running on the Premium plan with Always On enabled. This helps reduce the impact of cold starts and offloads pressure from databases and APIs.

Batching improves efficiency by grouping multiple data points or messages into a single function execution. For example, instead of writing each event or log entry individually, you can collect them into batches and send them to Cosmos DB or an external service in one operation. This reduces invocation overhead, lowers compute time, and helps you stay within throughput limits for downstream systems. Together, caching and batching help your application scale more smoothly and cost-effectively under load.

Security

Serverless architectures rely on a distributed system of managed services, each introducing its own configuration and potential exposure. Because serverless components often communicate over event-driven or asynchronous channels, the attack surface of serverless workloads can be difficult to define and even more challenging to monitor. For even more information about security best practices for serverless technologies, you can check out the Open Worldwide Application Security Project’s (OWASP) Top 10 recommendations.

Apply least privilege access with RBAC

Azure Role-Based Access Control (RBAC) enables you to define exactly which services and users can perform specific actions. Avoid granting broad default permissions to your functions or containerized workloads. Instead, assign roles that limit access to only the resources and operations that are required.

For example, a function that writes to a single Cosmos DB container should not be granted full database permissions. Instead, scope its identity to only that specific container and operation using managed identity and custom role definition. This limits the blast radius of a compromised function and helps enforce clear boundaries between components.

Authenticate and authorize every interaction

Each service-to-service call within your serverless application should be authenticated and authorized. Use Azure Managed Identities to enable secure authentication between functions, queues, and databases without managing credentials directly. When combined with RBAC, this allows for seamless yet secure internal communication.

For instance, an Azure Function that processes orders and writes to Cosmos DB can authenticate using its system-assigned managed identity. Cosmos DB can then validate the request using Microsoft Entra ID (formerly known as Azure Active Directory), eliminating the need for hardcoded credentials.

Encrypt data and secure network access

Protect sensitive data both at rest and in transit. Azure services like Blob Storage, Cosmos DB, and Service Bus support built-in encryption using Microsoft-managed or customer-managed keys. Enable these features consistently across your application’s data layer. In addition, restrict public exposure wherever possible. Use private endpoints to isolate traffic between Azure services within a virtual network. For containerized workloads in Azure Container Apps, you can configure environment-level network restrictions to ensure that only internal services can trigger certain APIs or send messages to a queue.

Cost optimization

Serverless environments offer a pay-as-you-go model, which is great for scaling workloads efficiently, but only if your usage is intentional and monitored. In Azure, misconfigured services, overprovisioned resources, or overlooked usage patterns can lead to unexpected costs.

Optimizing cost means selecting the right plans, scaling with precision, minimizing idle time, and using automation to right-size workloads. It also means aligning spending with business outcomes: not all savings are worth the trade-off in performance or reliability.

Azure’s cost analysis tools, autoscaling features, and workload-specific recommendations can help keep your app high-performing and affordable. By continuously measuring and adjusting usage patterns, you can ensure you’re spending wisely while delivering a smooth experience.

Keep functions lean

Function execution time directly impacts cost in a pay-as-you-go model. Slow functions not only delay user responses but also cost more to run. We discussed optimizing function performance earlier in this post, but a good rule of thumb to keep in mind is that reducing latency reduces cost.

To spot optimization opportunities, take advantage of profiling and tracing tools. For example, Azure Application Insights and distributed tracing with OpenTelemetry can help identify where functions are spending the most time—whether it’s on cold starts, external dependencies, or inefficient code paths.

For example, you might notice that a function’s execution time drops significantly after switching from a database lookup to a cache query. That performance gain directly lowers compute charges, especially at scale.

Avoid idle execution

Idle time can be a hidden cost in serverless apps, particularly in long-running workflows. If a function is waiting on I/O, holding a connection, or paused during a workflow, you’re still billed for the time it’s running.

Refactor long-running operations into shorter, event-driven steps, or use Durable Functions to offload state management. Durable Functions pause without consuming compute, making them a cost-effective choice for background tasks and multi-step workflows.

Choose the right pricing plan

Azure offers several hosting plans for functions. The Consumption plan is ideal for spiky or low-volume workloads since it charges strictly based on the number of executions. The Premium plan is better for high-throughput or latency-sensitive APIs, as it supports features like VNet integration, Always On, and prewarmed instances. However, the Premium plan comes at a higher baseline cost since you’re billed for provisioned instances even when idle.

Finally, the App Service plan runs Azure Functions on dedicated infrastructure shared with App Services. It offers predictable pricing with no per-execution charges but lacks autoscaling and can lead to underutilization if traffic is inconsistent. Choosing the right plan depends on your workload’s traffic patterns, latency requirements, and cost sensitivity.

Autoscale with intent

Autoscaling is one of the main advantages of serverless, but it needs to be configured thoughtfully. Set sensible minimum and maximum instance counts for Azure Container Apps to avoid unnecessary idle replicas.

For Cosmos DB, monitor request unit (RU/s) usage and enable autoscale throughput so that you’re not paying for capacity you don’t use. For example, if your Cosmos DB container serves traffic that spikes during lunch hours and drops in the evening, autoscale will automatically increase and decrease throughput limits, keeping performance stable while avoiding waste.

Operational excellence

Operational excellence means having the visibility you need in order to detect problems quickly, resolve them efficiently, and continuously improve how your system runs. Because serverless environments are dynamic and event-driven, traditional uptime monitoring isn’t enough. You need telemetry across metrics, traces, logs, and deployments to gain insight into application health and performance.

The Azure Well-Architected Framework encourages teams to build observability into every layer of their applications. This includes tracking error rates, cold starts, queue delays, and external dependencies. Datadog supports this approach by providing deep visibility into Azure serverless environments through integrated monitoring, tracing, and security tools.

Monitor workloads with Datadog’s Azure integration

Datadog’s Azure integration collects and visualizes metrics, traces, and logs from Azure Functions, Azure Container Apps, and other supporting services, providing a complete overview into your entire Azure infrastructure.

Screenshot of the Azure Overview Dashboard

With this integration, you can set up out-of-the-box dashboards to monitor key metrics like cold start durations, function error rates, queue backlogs, and API latency. You can also set alerts to notify teams of any anomalies that have been detected or whether a specified threshold has been surpassed. For instance, if your checkout API starts experiencing latency due to increased load, you’ll be notified before it affects the user experience.

End-to-end tracing with Datadog APM

Datadog APM supports native distributed tracing for Azure Functions and Azure Container Apps, and automatically connects those traces to other Azure-managed services in your environment—such as Cosmos DB, Service Bus, and API Management. This gives you a complete picture of each request as it propagates across your serverless application, so you can identify slow components, errors, or retries at a glance.

Screenshot of Azure end-to-end tracing

Datadog also enriches trace spans with metadata from Azure Function invocations, including cold start indicators, memory usage, and trigger context. You can search, filter, and group spans by function name, request ID, or custom tags, making it easier to understand how individual services contribute to performance or failure. Whether you’re debugging a failed checkout flow or tracing queue latency during a traffic spike, Datadog helps you pinpoint the issue and take action quickly.

Tie together logs, metrics, and traces for faster troubleshooting

Datadog automatically links metrics, logs, and distributed traces from across your system. When a function fails, you can trace the request that triggered it, see where delays occurred, and view logs for that specific invocation—all within Datadog.

Screenshot of Azure logs

For example, if a checkout failure occurs in an Azure Function, you can use Datadog to show you the full trace, including upstream queue delays, memory usage, or Cosmos DB throttling that led to the failure. This correlation accelerates root cause analysis and reduces mean time to resolution (MTTR).

Securing your Azure environment with Datadog Cloud SIEM

Datadog Cloud SIEM (Security Information and Event Management) enhances your Azure environment’s security posture by applying out-of-the-box detection rules to all ingested logs and events to detect potential threats such as unauthorized access or communications with malicious IP addresses.

Datadog automatically surfaces suspicious activity as Security Signals, which are visible in the Security Signals Explorer. From there, teams can quickly triage and investigate threats using correlated log data and built-in context.

Screenshot of Azure Cloud SIEM

By combining telemetry and security in a single platform, Datadog gives you a unified, contextualized view of your Azure ecosystem, helping you achieve operational excellence with confidence and agility.

Supporting all five pillars with Datadog

Building serverless applications on Azure requires careful design decisions across reliability, performance, security, cost, and operations. The Azure Well-Architected Framework provides a strong foundation for making those decisions, and with the right tools, you can put those principles into practice. By decoupling services, managing cold starts, enforcing least-privilege access, and continuously monitoring your stack, you can deliver serverless systems that are both scalable and resilient.

Datadog gives you the insight you need to implement these best practices confidently. From real-time performance metrics and traces to CI/CD insights and cost-aware dashboards, Datadog brings full visibility to every part of your Azure serverless architecture. To see how Datadog can help you build and operate well-architected serverless applications, start a 14-day free trial.