Build secure and scalable Azure serverless applications with the Well-Architected Framework

Datadog | The Monitor blog

Introducing our open source AI-native SAST Instrument and monitor Boomi integration flows with OpenTelemetry and Datadog Not all index scans are equal: How we cut query latency by over 99% Platform engineering metrics: What to measure and what to ignore Integrate Recorded Future threat intelligence with Datadog Cloud SIEM CI/CD security: threat modeling using a MITRE-style threat matrix CI/CD security: How to secure your GitHub ecosystem Ingress NGINX is EOL: A practical guide for migrating to Kubernetes Gateway API Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA Introducing the Datadog Code Security MCP Capture and analyze custom heatmaps in Session Replay Understand session replays faster with AI summaries and smart chapters Monitor ClickHouse query performance with Datadog Database Monitoring How we designed empathetic alert sounds for on-call engineers Search and act across Datadog to resolve issues faster with Bits Assistant Measure the business impact of every product change with Datadog Experiments Analyzing round trip query latency Configuring JavaScript caches for better performance Introducing Bits AI Dev Agent for Code Security Datadog achieves ISO 42001 certification for responsible AI Monitor Nutanix clusters, hosts, and VMs with Datadog Monitor Juniper Mist in Datadog A new Host Map for modern infrastructure Annotate traces to improve LLM quality with Datadog LLM Observability What’s new in Cloud SIEM: AI-powered investigations, enhanced threat intelligence, and scalable security operations Explore Kubernetes with native OpenTelemetry data Monitor Oracle Fusion Cloud Applications with Datadog Announcing the Datadog Terraform provider v4.0.0 Scaling Kubernetes workloads on custom metrics How to design cloud environments for AI-powered threat analysis Monitor Aruba Central in Datadog How we centralize and remediate risks with Datadog Case Management Accelerate incident response with Datadog and ServiceNow Monitor your application and network load balancer logs Understanding Karpenter architecture for Kubernetes autoscaling Tools for collecting metrics and logs from Karpenter Monitor Karpenter with Datadog What your product data is actually saying Key metrics for monitoring Karpenter Securing Datadog’s platform in the AI age: The role of observability data Four ways engineering teams use the Datadog MCP Server to power AI agents Approaching your observability migration with the right mindset Meet the new Bits AI SRE: Deeper reasoning, twice as fast Key learnings from the 2026 State of DevSecOps study Use plain English to query your multi-cloud infrastructure in Resource Catalog Simplifying troubleshooting across the user journey with Datadog Synthetic Monitoring Protect your OCI resources with Datadog Cloud Security This Month in Datadog - February 2026 Amazon EC2 security: How misconfigured and public AMIs expand your cloud attack surface Enable end-to-end visibility into your Java apps with a single command Measure and improve mobile app startup performance with Datadog RUM Evaluating our AI Guard application to improve quality and control cost Identify untested code across every level of your codebase Make use of guardrail metrics and stop babysitting your releases Monitor Versa Networks SD-WAN performance in Datadog Improve performance and reliability with APM Recommendations Remediate transitive vulnerabilities faster with Datadog Software Composition Analysis Generate audit-ready vulnerability and compliance reports with Datadog Sheets Monitor Fortinet FortiManager performance in Datadog Improve test coverage across codebases with Datadog Code Coverage Move fast, don’t break things: Consistent testing standards at scale Enrich logs with ServiceNow CMDB context before routing to any SIEM or logging tool Monitor Lustre with Datadog Make faster, better product decisions with Datadog Product Analytics Surface and remediate runtime posture issues with Workload Protection Findings Protect agentic AI applications with Datadog AI Guard How to optimize JavaScript code with CSS Trace Google Pub/Sub workloads in Cloud Run with Datadog Detect human names in logs with ML in Sensitive Data Scanner How we cut our NLQ agent debugging time from hours to minutes with LLM Observability Debug PostgreSQL query latency faster with EXPLAIN ANALYZE in Datadog Database Monitoring Datadog acquires Propolis Unify and correlate frontend and backend data with retention filters Scale compliance across global frameworks with Datadog Cloud Security Monitor Arista VeloCloud SD-WAN performance with Datadog Building reliable dashboard agents with Datadog LLM Observability Simplify log collection and aggregation for MSSPs with Datadog Observability Pipelines Mitigation for Node.js denial-of-service vulnerability affecting Datadog APM Automate flaky test fixes with the Bits AI Dev Agent and Test Optimization How we built an AI SRE agent that investigates like a team of engineers Datadog integrations 2025 recap: Observability for AI, security, and hybrid cloud Design effective executive dashboards with Datadog Implement dbt data quality checks with dbt-expectations Bring faster visibility into AWS Lambda functions with remote instrumentation Troubleshoot faster with the GitLab Source Code integration in Datadog How Cambia Health Solutions saved $30,000 monthly with Cloud Cost Management and the Datadog Resource Catalog Normalize any logs for Cloud SIEM with Datadog's OCSF processor Optimizing Datadog at scale: Cost-efficient observability at Zendesk Detect, diagnose, and resolve network issues easily with CNM Network Health Connect engineering errors to user impact in early-stage products Cilium configuration for Kubernetes operations at scale Designing feedback loops for progressive delivery Ship features faster and safer with Datadog Feature Flags Choosing the right OpenTelemetry Collector distribution Route your monitor alerts with Datadog monitor notification rules Automate Cloud SIEM investigations with Bits AI Security Analyst Cloud threat detection: How to identify risky activity across control and data planes Collecting Kafka performance metrics Monitoring Kafka with Datadog Monitoring Kafka performance metrics

2025-08-12 · via Datadog | The Monitor blog

Jordan Obey

Serverless platforms like Azure Functions and Azure Container Apps make it easier to scale your applications without managing infrastructure. But successful serverless apps require thoughtful planning. They must be designed to account for cold starts, unpredictable scaling behavior, and ephemeral compute lifecycles, all while ensuring secure data handling and end-to-end observability across highly distributed components.

The Azure Well-Architected Framework provides five pillars to guide this design process. In this post, we’ll explore how to build more robust serverless solutions by applying its principles to common architectural challenges.

Reliability
Performance
Security
Cost optimization
Operational excellence

We’ll also show you how Datadog supports each of these pillars by providing the insights needed to ensure that your serverless applications are secure, cost-efficient, and behaving as expected.

Reliability

In a serverless context, reliability is particularly important because you’re relying on several managed services that scale independently and operate on demand. If not properly isolated, outages or performance issues in any one component—whether it’s your API layer, messaging queue, or background function—can ripple through your application.

This section covers how to achieve reliability within your serverless workloads by ensuring high availability and managing function failures gracefully.

Ensure high availability

High availability refers to an application’s ability to remain operational even when part of the system fails. Azure Functions are hosted in multiple regions and availability zones by default, offering built-in high availability. However, this default behavior can be affected when you integrate other components—such as databases, queues, or virtual networks—which may require manual configuration.

For instance, if you deploy Azure Functions inside a virtual network (VNet), your app is placed in a single availability zone within that VNet’s region by default. This means that if that zone experiences an outage, your Functions app may become unavailable, even if the rest of the region remains healthy. To mitigate this risk, you should provision subnets across multiple availability zones and configure your infrastructure accordingly to preserve high availability and prevent a single point of failure from impacting your entire application.

As another example, Azure also recommends enabling multi-region writes in Azure Cosmos DB to keep your database accessible if a region becomes unavailable.

Manage function failures gracefully

Even with high availability, some functions may still fail due to unexpected errors, unhandled exceptions, or service timeouts. You’ll need to account for these failures and ensure that services retry safely and consistently.

For instance, Azure Functions triggered by queues or event streams automatically retry when a failure occurs. This means they may be invoked multiple times for the same event, unintentionally creating duplicate emails or repeated charges. To avoid this issue, ensure that functions are idempotent, meaning they can run multiple times with the same input and still produce the same result. For example, a checkout function should verify whether a payment has already been processed before submitting it again.

To handle persistent failures, you can configure a dead-letter queue (DLQ) to capture messages that have failed after a set number of retries. This prevents them from being retried indefinitely and allows for later analysis, remediation, or reprocessing. DLQs are especially critical when dealing with external systems or non-idempotent operations.

For functions that trigger synchronously (such as HTTP APIs), you’ll need to build retry logic into the calling service. Be sure that your function returns appropriate status codes so consumers can distinguish between client-side and server-side errors. Azure API Management or Application Gateway can help route and shape this traffic, but the function’s runtime code should still handle errors explicitly.

Performance

Serverless platforms like Azure are designed to scale quickly and handle bursts of traffic, but that doesn’t automatically guarantee fast response times. Cold starts, inefficient logic, and resource misconfiguration can still slow down your workloads, especially when usage is unpredictable and prone to spikes. The performance pillar focuses on using computing resources efficiently to deliver responsive user experiences at any scale.

Azure provides several tools to help serverless workloads perform well under pressure, including autoscaling, caching, and message queues. In the sections below, we’ll cover how to get the most out of these features and ensure that your Azure-hosted serverless workloads can scale and respond quickly to requests.

Minimize cold starts

Cold starts occur when a function or container has to initialize from scratch before responding to a request. This added startup time can significantly increase response latency, which is particularly detrimental for latency-sensitive services like login, search, or checkout APIs.

If you are using Azure Container Apps, you can reduce cold start latency by configuring a minimum replica count when you set scaling rules. This ensures that some containers remain warm and ready to respond immediately, even during low-traffic periods. For example, if you run an ecommerce API that sees unpredictable spikes during a sale or marketing campaign, configuring a minimum of one or two pre-warmed replicas helps absorb sudden traffic without forcing the platform to start from zero. A YAML-based configuration might include a scale block that looks like the following:

scale:
  minReplicas: 2
  maxReplicas: 10
  rules:
    - name: http-rule
      http:
        metadata:
          concurrentRequests: "50"

The scale block defines how your container app adjusts its number of replicas based on incoming traffic and other conditions. In the scale block above, minReplicas: 2 keeps two containers running at all times to reduce cold start latency, while maxReplicas: 10 allows the app to scale up when needed. The http-rule tells Azure to add more replicas when any instance handles more than 50 concurrent HTTP requests.

If you’re using Azure Functions instead of Azure Container Apps, you can mitigate cold starts by running your functions on the Premium plan, which offers always ready and prewarmed instances, or by enabling provisioned concurrency. The Premium plan is well suited for APIs and interactive endpoints that require consistent performance. For example, a customer authentication service that runs on the Premium plan can avoid the latency spikes that might otherwise frustrate users logging in during peak hours. You may also consider running workloads with strict latency requirements in Azure Container Apps or long-running containers, where you can eliminate cold starts entirely and fine-tune performance at the runtime level.

Process functions efficiently

In serverless environments, where you’re billed based on execution time and resource consumption, streamlining your functions has both performance and cost benefits. Keep serverless functions efficient by scoping each one to a specific task. Scoping your functions to a single task simplifies debugging, improves cold start times, and makes scaling more predictable. You can also use monitoring tools like Datadog to get insight into metrics like memory usage, execution duration, concurrent invocations, and error rates, to help you ensure that each function is right-sized and tuned for its workload.

For example, it often makes sense to group related tasks—like validating input and saving data to a database—into a single function because they belong to the same logical unit of work. An action like sending a confirmation email, however, can be separated into its own function and triggered asynchronously. This keeps the critical path simple and reliable while allowing side effects to scale and fail independently.

Use batching and caching when possible

Caching and batching are effective ways to reduce latency and improve the performance of serverless applications. Caching allows your functions to quickly retrieve frequently used data—such as product details, user profiles, or configuration settings—without needing to repeatedly query a backend service. You can use Azure Cache for Redis for durable, distributed caching, or in-memory caching for functions running on the Premium plan with Always On enabled. This helps reduce the impact of cold starts and offloads pressure from databases and APIs.

Batching improves efficiency by grouping multiple data points or messages into a single function execution. For example, instead of writing each event or log entry individually, you can collect them into batches and send them to Cosmos DB or an external service in one operation. This reduces invocation overhead, lowers compute time, and helps you stay within throughput limits for downstream systems. Together, caching and batching help your application scale more smoothly and cost-effectively under load.

Security

Serverless architectures rely on a distributed system of managed services, each introducing its own configuration and potential exposure. Because serverless components often communicate over event-driven or asynchronous channels, the attack surface of serverless workloads can be difficult to define and even more challenging to monitor. For even more information about security best practices for serverless technologies, you can check out the Open Worldwide Application Security Project’s (OWASP) Top 10 recommendations.

Apply least privilege access with RBAC

Azure Role-Based Access Control (RBAC) enables you to define exactly which services and users can perform specific actions. Avoid granting broad default permissions to your functions or containerized workloads. Instead, assign roles that limit access to only the resources and operations that are required.

For example, a function that writes to a single Cosmos DB container should not be granted full database permissions. Instead, scope its identity to only that specific container and operation using managed identity and custom role definition. This limits the blast radius of a compromised function and helps enforce clear boundaries between components.

Authenticate and authorize every interaction

Each service-to-service call within your serverless application should be authenticated and authorized. Use Azure Managed Identities to enable secure authentication between functions, queues, and databases without managing credentials directly. When combined with RBAC, this allows for seamless yet secure internal communication.

For instance, an Azure Function that processes orders and writes to Cosmos DB can authenticate using its system-assigned managed identity. Cosmos DB can then validate the request using Microsoft Entra ID (formerly known as Azure Active Directory), eliminating the need for hardcoded credentials.

Encrypt data and secure network access

Protect sensitive data both at rest and in transit. Azure services like Blob Storage, Cosmos DB, and Service Bus support built-in encryption using Microsoft-managed or customer-managed keys. Enable these features consistently across your application’s data layer. In addition, restrict public exposure wherever possible. Use private endpoints to isolate traffic between Azure services within a virtual network. For containerized workloads in Azure Container Apps, you can configure environment-level network restrictions to ensure that only internal services can trigger certain APIs or send messages to a queue.

Cost optimization

Serverless environments offer a pay-as-you-go model, which is great for scaling workloads efficiently, but only if your usage is intentional and monitored. In Azure, misconfigured services, overprovisioned resources, or overlooked usage patterns can lead to unexpected costs.

Optimizing cost means selecting the right plans, scaling with precision, minimizing idle time, and using automation to right-size workloads. It also means aligning spending with business outcomes: not all savings are worth the trade-off in performance or reliability.

Azure’s cost analysis tools, autoscaling features, and workload-specific recommendations can help keep your app high-performing and affordable. By continuously measuring and adjusting usage patterns, you can ensure you’re spending wisely while delivering a smooth experience.

Keep functions lean

Function execution time directly impacts cost in a pay-as-you-go model. Slow functions not only delay user responses but also cost more to run. We discussed optimizing function performance earlier in this post, but a good rule of thumb to keep in mind is that reducing latency reduces cost.

To spot optimization opportunities, take advantage of profiling and tracing tools. For example, Azure Application Insights and distributed tracing with OpenTelemetry can help identify where functions are spending the most time—whether it’s on cold starts, external dependencies, or inefficient code paths.

For example, you might notice that a function’s execution time drops significantly after switching from a database lookup to a cache query. That performance gain directly lowers compute charges, especially at scale.

Avoid idle execution

Idle time can be a hidden cost in serverless apps, particularly in long-running workflows. If a function is waiting on I/O, holding a connection, or paused during a workflow, you’re still billed for the time it’s running.

Refactor long-running operations into shorter, event-driven steps, or use Durable Functions to offload state management. Durable Functions pause without consuming compute, making them a cost-effective choice for background tasks and multi-step workflows.

Choose the right pricing plan

Azure offers several hosting plans for functions. The Consumption plan is ideal for spiky or low-volume workloads since it charges strictly based on the number of executions. The Premium plan is better for high-throughput or latency-sensitive APIs, as it supports features like VNet integration, Always On, and prewarmed instances. However, the Premium plan comes at a higher baseline cost since you’re billed for provisioned instances even when idle.

Finally, the App Service plan runs Azure Functions on dedicated infrastructure shared with App Services. It offers predictable pricing with no per-execution charges but lacks autoscaling and can lead to underutilization if traffic is inconsistent. Choosing the right plan depends on your workload’s traffic patterns, latency requirements, and cost sensitivity.

Autoscale with intent

Autoscaling is one of the main advantages of serverless, but it needs to be configured thoughtfully. Set sensible minimum and maximum instance counts for Azure Container Apps to avoid unnecessary idle replicas.

For Cosmos DB, monitor request unit (RU/s) usage and enable autoscale throughput so that you’re not paying for capacity you don’t use. For example, if your Cosmos DB container serves traffic that spikes during lunch hours and drops in the evening, autoscale will automatically increase and decrease throughput limits, keeping performance stable while avoiding waste.

Operational excellence

Operational excellence means having the visibility you need in order to detect problems quickly, resolve them efficiently, and continuously improve how your system runs. Because serverless environments are dynamic and event-driven, traditional uptime monitoring isn’t enough. You need telemetry across metrics, traces, logs, and deployments to gain insight into application health and performance.

The Azure Well-Architected Framework encourages teams to build observability into every layer of their applications. This includes tracking error rates, cold starts, queue delays, and external dependencies. Datadog supports this approach by providing deep visibility into Azure serverless environments through integrated monitoring, tracing, and security tools.

Monitor workloads with Datadog’s Azure integration

Datadog’s Azure integration collects and visualizes metrics, traces, and logs from Azure Functions, Azure Container Apps, and other supporting services, providing a complete overview into your entire Azure infrastructure.

With this integration, you can set up out-of-the-box dashboards to monitor key metrics like cold start durations, function error rates, queue backlogs, and API latency. You can also set alerts to notify teams of any anomalies that have been detected or whether a specified threshold has been surpassed. For instance, if your checkout API starts experiencing latency due to increased load, you’ll be notified before it affects the user experience.

End-to-end tracing with Datadog APM

Datadog APM supports native distributed tracing for Azure Functions and Azure Container Apps, and automatically connects those traces to other Azure-managed services in your environment—such as Cosmos DB, Service Bus, and API Management. This gives you a complete picture of each request as it propagates across your serverless application, so you can identify slow components, errors, or retries at a glance.

Datadog also enriches trace spans with metadata from Azure Function invocations, including cold start indicators, memory usage, and trigger context. You can search, filter, and group spans by function name, request ID, or custom tags, making it easier to understand how individual services contribute to performance or failure. Whether you’re debugging a failed checkout flow or tracing queue latency during a traffic spike, Datadog helps you pinpoint the issue and take action quickly.

Tie together logs, metrics, and traces for faster troubleshooting

Datadog automatically links metrics, logs, and distributed traces from across your system. When a function fails, you can trace the request that triggered it, see where delays occurred, and view logs for that specific invocation—all within Datadog.

For example, if a checkout failure occurs in an Azure Function, you can use Datadog to show you the full trace, including upstream queue delays, memory usage, or Cosmos DB throttling that led to the failure. This correlation accelerates root cause analysis and reduces mean time to resolution (MTTR).

Securing your Azure environment with Datadog Cloud SIEM

Datadog Cloud SIEM (Security Information and Event Management) enhances your Azure environment’s security posture by applying out-of-the-box detection rules to all ingested logs and events to detect potential threats such as unauthorized access or communications with malicious IP addresses.

Datadog automatically surfaces suspicious activity as Security Signals, which are visible in the Security Signals Explorer. From there, teams can quickly triage and investigate threats using correlated log data and built-in context.

By combining telemetry and security in a single platform, Datadog gives you a unified, contextualized view of your Azure ecosystem, helping you achieve operational excellence with confidence and agility.

Supporting all five pillars with Datadog

Building serverless applications on Azure requires careful design decisions across reliability, performance, security, cost, and operations. The Azure Well-Architected Framework provides a strong foundation for making those decisions, and with the right tools, you can put those principles into practice. By decoupling services, managing cold starts, enforcing least-privilege access, and continuously monitoring your stack, you can deliver serverless systems that are both scalable and resilient.

Datadog gives you the insight you need to implement these best practices confidently. From real-time performance metrics and traces to CI/CD insights and cost-aware dashboards, Datadog brings full visibility to every part of your Azure serverless architecture. To see how Datadog can help you build and operate well-architected serverless applications, start a 14-day free trial.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Datadog | The Monitor blog