Accelerate building resiliency into systems with Cloudflare Workers

The Cloudflare Blog

The day my ping took countermeasures Announcing Claude Compliance API support with Cloudflare CASB Announcing Claude Managed Agents on Cloudflare Project Glasswing: what Mythos showed us Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse Browser Run: now running on Cloudflare Containers, it’s faster and more scalable When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug Building For The Future How Cloudflare responded to the “Copy Fail” Linux vulnerability When DNSSEC goes wrong: how we responded to the .de TLD outage Code Orange: Fail Small is complete. The result is a stronger Cloudflare network Introducing Dynamic Workflows: durable execution that follows the tenant Post-quantum encryption for Cloudflare IPsec is generally available Agents can now create Cloudflare accounts, buy domains, and deploy Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen Moving past bots vs. humans Building the agentic cloud: everything we launched during Agents Week 2026 The AI engineering stack we built internally — on the platform we ship Orchestrating AI Code Review at scale Introducing the Agent Readiness score. Check to see if your site is agent-ready Shared Dictionaries: compression that keeps up with the agentic web Redirects for AI Training enforces canonical content Unweight: how we compressed an LLM 22% without sacrificing quality Agents that remember: introducing Agent Memory Agents Week: network performance update Introducing Flagship: feature flags built for the age of AI Cloudflare’s AI Platform: an inference layer designed for agents Building the foundation for running extra-large language models AI Search: the search primitive for your agents Deploy Postgres and MySQL databases with PlanetScale + Workers Artifacts: versioned storage that speaks Git Email for agents - Cloudflare Email Service now in public beta Project Think: building the next generation of AI agents on Cloudflare Introducing Agent Lee - a new interface to the Cloudflare stack Register domains wherever you build: Cloudflare Registrar API now in beta Browser Run: give your agents a browser Rearchitecting the Workflows control plane for the agentic era Add voice to your agent Managed OAuth for Access: make internal apps agent-ready in one click Securing non-human identities: automated revocation, OAuth, and scoped permissions Scaling MCP adoption: Our reference architecture for simpler, safer and cheaper enterprise deployments of MCP Secure private networking for everyone: users, nodes, agents, Workers — introducing Cloudflare Mesh Building a CLI for all of Cloudflare Durable Objects in Dynamic Workers: Give each AI-generated app its own database Agents have their own computers with Sandboxes GA Dynamic, identity-aware, and secure Sandbox auth Welcome to Agents Week 500 Tbps of capacity: 16 years of scaling our global network From bytecode to bytes- automated magic packet generation Cloudflare targets 2029 for full post-quantum security How we built Organizations to help enterprises manage Cloudflare at scale Why we're rethinking cache for the AI era Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver Introducing EmDash — the spiritual successor to WordPress that solves plugin security Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers Cloudflare Client-Side Security: smarter detection, now open to everyone How we use Abstract Syntax Trees (ASTs) to turn Workflows code into visual diagrams A one-line Kubernetes fix that saved 600 hours a year Sandboxing AI agents, 100x faster Inside Gen 13- how we built our most powerful server yet Launching Cloudflare’s Gen 13 servers- trading cache for cores for 2x edge compute performance Powering the agents: Workers AI now runs large models, starting with Kimi K2.5 Introducing Custom Regions for precision data control Standing up for the open Internet- why we appealed Italy’s Piracy Shield fine From legacy architecture to Cloudflare One Announcing Cloudflare Account Abuse Protection: prevent fraudulent attacks from bots and humans Slashing agent token costs by 98% with RFC 9457-compliant error responses AI Security for Apps is now generally available Building a security overview dashboard for actionable insights Investigating multi-vector attacks in Log Explorer Translating risk insights into actionable protection: leveling up security posture with Cloudflare and Mastercard Fixing request smuggling vulnerabilities in Pingora OSS deployments Active defense: introducing a stateful vulnerability scanner for APIs Complexity is a choice. SASE migrations shouldn’t take years. From the endpoint to the prompt: a unified data security vision in Cloudflare One Ending the "silent drop": how Dynamic Path MTU Discovery makes the Cloudflare One Client more resilient A QUICker SASE client: re-building Proxy Mode How Automatic Return Routing solves IP overlap Always-on detections: eliminating the WAF “log versus block” trade-off Mind the gap: new tools for continuous enforcement from boot to login Stop reacting to breaches and start preventing them with User Risk Scoring Defeating the deepfake: stopping laptop farms and insider threats Moving from license plates to badges: the Gateway Authorization Proxy Evolving Cloudflare’s Threat Intelligence Platform: actionable, scalable, and ETL-less Introducing the 2026 Cloudflare Threat Report See risk, fix risk: introducing Remediation in Cloudflare CASB How Cloudy translates complex security into human action From reactive to proactive: closing the phishing gap with LLMs Modernizing with agile SASE: a Cloudflare One blog takeover Beyond the blank slate: how Cloudflare accelerates your Zero Trust journey The truly programmable SASE platform Toxic combinations: when small signals add up to a security incident We deserve a better streams API for JavaScript The most-seen UI on the Internet? Redesigning Turnstile and Challenge Pages ASPA: making Internet routing more secure Bringing more transparency to post-quantum usage, encrypted messaging, and routing security How we rebuilt Next.js with AI in one week Cloudflare One is the first SASE offering modern post-quantum encryption across the full platform Cloudflare outage on February 20, 2026

Cloudflare Team · 2023-03-08 · via The Cloudflare Blog

In this blog post we’ll discuss how Cloudflare Workers enabled us to quickly improve the resiliency of a legacy system. In particular, we’ll discuss how we prevented the email notification systems within Cloudflare from outages caused by external vendors.

Email notification services

At Cloudflare, we send email notifications to customers such as billing invoices, password resets, OTP logins and certificate status updates. We rely on external Email Service Providers (ESPs) to deliver these emails to customers.

The following diagram shows how the system looks. Multiple services in our control plane dispatch emails through an external email vendor. They use HTTP Transmission APIs and also SMTP to send messages to the vendor. If dispatching an email fails, they are retried with exponential back-off mechanisms. Even when our ESP has outages, the retry mechanisms in place guarantee that we don’t lose any messages.

Why did we need to improve resilience?

In some cases, it isn’t sufficient to just deliver the email to the customer; it must be delivered on time. For example, OTP login emails are extremely time sensitive; their validity is short-lived such that a delay in sending them is as bad as not sending them at all. If the ESP is down, all emails including the critical ones will be delayed.

Some time ago our ESP suffered an outage; both the HTTP API and SMTP were unavailable and we couldn’t send emails to customers for a few hours. Several thousand emails couldn’t be delivered immediately and were queued. This led to a delay in clearing out the backlogged messages in the retry queue in addition to sending new messages.

Critical emails not being sent is a major incident for us. A major incident impacts our customers, our business, and also our reputation. We take serious measures to prevent it from happening again in the future. Hence improving the resiliency of this system was our top priority.

Our concrete action items to reduce the failure possibilities were to improve availability by onboarding another email vendor and, when ESP has an outage, mitigate downtime by failing over from one vendor to another.

We wanted to achieve the above as quickly as possible so that our customers are not impacted again.

Step 1 - Improve Availability

Introducing a new vendor gives us the option to failover. But before we introduced it we had to consider another key component of email deliverability, which is the email sender reputation. ISPs assign a sending score to any organization that sends emails. The higher the score, the higher the probability that the email sent by the organization reaches your inbox. Similarly the lower the score, the higher the probability it reaches your spam folder.

Adding a new vendor means emails will be delivered from a new IP address. Mailbox providers (such as Gmail or Hotmail) view new IP addresses as suspicious until they achieve a positive sending reputation. A positive sending reputation is determined by answering the question “did the recipient want this email?”. This is calculated differently for different mailbox providers, but the simplest measure would be that the recipient opened it, and perhaps clicked a link within. To build this reputation, it's recommended to use an IP warm-up strategy. This usually involves increasing e-mail volume slowly over a matter of days and weeks.

Using this new vendor “only” during failovers would increase the probability of emails reaching our customer’s spam folder for the reasons outlined above. To avoid this and establish a positive sending reputation we want to send a constant stream of traffic between the two vendors.

Solution

We came up with a weighted traffic routing module, an algorithm to route emails to the ESPs based on the ratio of weights assigned to them. This is how the algorithm works roughly:

This algorithm solves the problem of maintaining a constant stream of traffic between the two vendors and to failover when outages happen.

Our next big question was “where” to implement this module? How should this be implemented so that we can build, deploy, and test as quickly as possible? Since there are multiple teams involved, how can we make it quicker for all of them? This leads to a perfect segue to our next step.

Step 2 - Mitigate downtime

Let’s discuss a few approaches on where to build this logic. Should this just be a library? Or should it be a service in itself? Or do we have an even better way?

Design 1 - A simple solution

In this proposed solution, we would implement the weighted traffic routing module in a library and import it in every service. The pros of this approach is that there are no major changes to the overall existing architecture. However, it does come with quite a few drawbacks. Firstly, All of the services require a major code change and a considerable amount of development effort would have to be spent on this integration and testing end-to-end. Furthermore, updating libraries can be slow and there is no guarantee that teams would update in a timely manner. This means we may see teams still sending only to the old ESP during an incident, and we have not solved the problem we set out to.

The first approach is simple and doesn’t disrupt the architecture, but is not a sustainable solution in the long run. This naturally leads to our next design.

In this iteration, the weighted-traffic-routing module is extracted to its own service. This service decides where to route emails. Although it’s better than the previous approach, it is not without its drawbacks.

Firstly, the positives. This approach requires no major code changes in each service. As long as we honor the original ESP contract, team’s should simply need to update the base URL. Furthermore, we achieve the Single Responsibility Principle. When outages happen, the configurations have to be changed only in the weighted-traffic-routing service and this is the only service to be modified. Finally, we have much more control over retry semantics as they are implemented in a single place.

Some of the drawbacks of taking this approach are the time involved in spinning up a new service from scratch - building, testing, and shipping it is not insignificant. Secondly, we are adding an additional network hop; previously the email-notification-services talked directly to the email service providers, now we are proxying it through another service. Furthermore, since all the upstream requests are funneled through this service, it has to be load-balanced and scaled according to the incoming traffic and we need to ensure we service every request.

Design 3 - The Workers way

The final design we considered; Implementing the weighted traffic routing module in a Cloudflare Worker.

There are a lot of positives to doing it this way.

Firstly, minimal code changes are required in each service. In fact, nothing beyond the environment variables need to be updated. As with the previous design, when email vendor outages happen, only the Email Weighted Traffic Router Worker configuration has to be changed. One positive of using a Cloudflare Worker is when an environment variable is changed and deployed, it’s deployed in all of Cloudflare’s network in a few milliseconds. This is a significant advantage for us because it means the failover process takes only a few milliseconds; time-critical emails will reach our customers’ inbox on time, preventing major incidents.

Secondly, As the email traffic increases, we don’t have to care about scaling or load balancing the weighted-traffic-routing module. A Worker automatically takes care of it.

Finally, the operational overhead to building and running this is incredibly low. We don’t have to spend time on spinning up new containers, configuring and deploying them. The majority of the work is just to programme the algorithm that we discussed earlier.

There are still some drawbacks here though.We are still introducing a new service and an additional network hop. However, we can make use of the templates available here to make bootstrapping as efficient as possible.

Building it the Workers way

Weighing the pros and cons of the above approaches, we went with implementing it in a Cloudflare Worker.

The Email Traffic Router Worker sits in between the Cloudflare control plane and the ESPs and proxies requests between them. We stream raw responses from the upstream; the internal service is completely agnostic of multiple ESPs, and how the traffic is routed.

We configure the percentage of traffic we want to send to each vendor using environment variables. Cloudflare Workers provide environment variables which is a convenient way to configure a worker application.

addEventListener('fetch', async (event: FetchEvent) => {
try {
    const finalInstance = ratioBasedRandPicker(
      parseFloat(VENDOR1_TO_VENDOR2_RATIO),
      API_ENDPOINT_VENDOR1,
      API_ENDPOINT_VENDOR2,
    )
    // stream request and response
    return doRequest(request, finalInstance)
  } catch (err) {
       // handle errors
    }
})

export const ratioBasedRandPicker = (
  ratio: number,
  VENDOR1: string,
  VENDOR2: string,
  generator: () => number = Math.random,
): string => {
  if (isNaN(ratio) || ratio < 0 || ratio > 1) throw new ConfigurationError(`invalid ratio ${ratio}`)
  return generator() < ratio ? VENDOR1 : VENDOR2  
}

The Email Weighted Traffic Router Worker generates a random number between 0 (inclusive) and 1 with approximately uniform distribution over the range. When the ratio is set to 1, vendor 1 will be picked always, and similarly when it’s set to 0 vendor 2 will be picked all the time. For any values in between, the vendor is picked based on the weights assigned. Thus, when all the traffic has to be routed to vendor 1 we set the ratio to 1; the random numbers generated are always less than 1, hence vendor 1 will always be selected.

Doing a failover is as simple as updating the vendor1_to_vendor2 environment variable. In general, an environment variable for a Cloudflare Worker can be updated in one of the following ways:

Using Cloudflare Dashboard: All Worker environment variables are available to be configured in the Cloudflare Dashboard. Navigate to Workers -> Settings -> Variables -> Edit Variables and update the value

Using Wrangler (if your Worker is managed by wrangler): edit the environment variable in the wrangler.toml file and execute wrangler publish for the new values to kick in.

Once the variable is updated using one of the above methods, the Email Traffic Router Worker is deployed immediately in all Cloudflare data centers with this value. Hence all our emails will now be routed by the healthy vendor.

Now, we have built and shipped our weighted-traffic-routing logic in a Worker and building it in a Worker really accelerated our process of improving the resiliency.

The module is shipped, now what happens when the ESP is suffering an outage or a degraded performance? Let me conclude with a success story.

When an ESP outage happened

After deploying this Email Traffic Router Worker to production, we saw an outage declared by one of our vendors. We immediately configured our Traffic Router to send all traffic to the healthy vendor. Changing this configuration took about a few seconds, and deploying the change was even faster. Within a few milliseconds all the traffic was routed to the healthy instance, and all the critical emails reached their corresponding inboxes on time!

Looking at some timelines:

18:45 UTC - we get alerted that message delivery is slow/degraded from email provider 1. The other provider is not impacted

18:47 UTC - Teams within Cloudflare confirmed that their message outbound is experiencing delays

18:47 UTC - we decide to failover

18:50 UTC - we configured the Email Traffic Router Worker to dispatch all the traffic to the healthy email provider, deployed the changes

Starting 18:50 we saw a steady decrease in requests to email provider 1 and a steady uptick in the requests to the provider 2 instance.

Here’s our metrics dashboard:

Next steps

Currently, the failover step is manual. Our next step is to automate this process. Once we get notified that one of the vendors is experiencing outages, the Email Traffic Router Worker will failover automatically to the healthy vendor.

Conclusion

To reiterate our goals, we wanted to make our email notification systems resilient by

adding another vendor
having a simple and fast mechanism to failover when vendor outages happen
achieving the above two objectives as quickly as possible to reduce customer impact

Thus our email systems are more reliable, and more resilient to any outages or failures of our email service providers. Cloudflare Worker makes it easy and simple to build and ship a production-ready service in no time; it also makes failovers possible without any code changes or downtime. It’s powerful that multiple services/teams within Cloudflare can rely on one place, the Email Traffic Router Worker, for uninterrupted email delivery.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Cloudflare Blog

Email notification services

Why did we need to improve resilience?

Step 1 - Improve Availability

Solution

Step 2 - Mitigate downtime

Design 1 - A simple solution

Design 3 - The Workers way

Building it the Workers way

When an ESP outage happened

Next steps

Conclusion