How we use HashiCorp Nomad

The Cloudflare Blog

The day my ping took countermeasures Announcing Claude Compliance API support with Cloudflare CASB Announcing Claude Managed Agents on Cloudflare Project Glasswing: what Mythos showed us Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse Browser Run: now running on Cloudflare Containers, it’s faster and more scalable When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug Building For The Future How Cloudflare responded to the “Copy Fail” Linux vulnerability When DNSSEC goes wrong: how we responded to the .de TLD outage Code Orange: Fail Small is complete. The result is a stronger Cloudflare network Introducing Dynamic Workflows: durable execution that follows the tenant Post-quantum encryption for Cloudflare IPsec is generally available Agents can now create Cloudflare accounts, buy domains, and deploy Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen Moving past bots vs. humans Building the agentic cloud: everything we launched during Agents Week 2026 The AI engineering stack we built internally — on the platform we ship Orchestrating AI Code Review at scale Introducing the Agent Readiness score. Check to see if your site is agent-ready Shared Dictionaries: compression that keeps up with the agentic web Redirects for AI Training enforces canonical content Unweight: how we compressed an LLM 22% without sacrificing quality Agents that remember: introducing Agent Memory Agents Week: network performance update Introducing Flagship: feature flags built for the age of AI Cloudflare’s AI Platform: an inference layer designed for agents Building the foundation for running extra-large language models AI Search: the search primitive for your agents Deploy Postgres and MySQL databases with PlanetScale + Workers Artifacts: versioned storage that speaks Git Email for agents - Cloudflare Email Service now in public beta Project Think: building the next generation of AI agents on Cloudflare Introducing Agent Lee - a new interface to the Cloudflare stack Register domains wherever you build: Cloudflare Registrar API now in beta Browser Run: give your agents a browser Rearchitecting the Workflows control plane for the agentic era Add voice to your agent Managed OAuth for Access: make internal apps agent-ready in one click Securing non-human identities: automated revocation, OAuth, and scoped permissions Scaling MCP adoption: Our reference architecture for simpler, safer and cheaper enterprise deployments of MCP Secure private networking for everyone: users, nodes, agents, Workers — introducing Cloudflare Mesh Building a CLI for all of Cloudflare Durable Objects in Dynamic Workers: Give each AI-generated app its own database Agents have their own computers with Sandboxes GA Dynamic, identity-aware, and secure Sandbox auth Welcome to Agents Week 500 Tbps of capacity: 16 years of scaling our global network From bytecode to bytes- automated magic packet generation Cloudflare targets 2029 for full post-quantum security How we built Organizations to help enterprises manage Cloudflare at scale Why we're rethinking cache for the AI era Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver Introducing EmDash — the spiritual successor to WordPress that solves plugin security Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers Cloudflare Client-Side Security: smarter detection, now open to everyone How we use Abstract Syntax Trees (ASTs) to turn Workflows code into visual diagrams A one-line Kubernetes fix that saved 600 hours a year Sandboxing AI agents, 100x faster Inside Gen 13- how we built our most powerful server yet Launching Cloudflare’s Gen 13 servers- trading cache for cores for 2x edge compute performance Powering the agents: Workers AI now runs large models, starting with Kimi K2.5 Introducing Custom Regions for precision data control Standing up for the open Internet- why we appealed Italy’s Piracy Shield fine From legacy architecture to Cloudflare One Announcing Cloudflare Account Abuse Protection: prevent fraudulent attacks from bots and humans Slashing agent token costs by 98% with RFC 9457-compliant error responses AI Security for Apps is now generally available Building a security overview dashboard for actionable insights Investigating multi-vector attacks in Log Explorer Translating risk insights into actionable protection: leveling up security posture with Cloudflare and Mastercard Fixing request smuggling vulnerabilities in Pingora OSS deployments Active defense: introducing a stateful vulnerability scanner for APIs Complexity is a choice. SASE migrations shouldn’t take years. From the endpoint to the prompt: a unified data security vision in Cloudflare One Ending the "silent drop": how Dynamic Path MTU Discovery makes the Cloudflare One Client more resilient A QUICker SASE client: re-building Proxy Mode How Automatic Return Routing solves IP overlap Always-on detections: eliminating the WAF “log versus block” trade-off Mind the gap: new tools for continuous enforcement from boot to login Stop reacting to breaches and start preventing them with User Risk Scoring Defeating the deepfake: stopping laptop farms and insider threats Moving from license plates to badges: the Gateway Authorization Proxy Evolving Cloudflare’s Threat Intelligence Platform: actionable, scalable, and ETL-less Introducing the 2026 Cloudflare Threat Report See risk, fix risk: introducing Remediation in Cloudflare CASB How Cloudy translates complex security into human action From reactive to proactive: closing the phishing gap with LLMs Modernizing with agile SASE: a Cloudflare One blog takeover Beyond the blank slate: how Cloudflare accelerates your Zero Trust journey The truly programmable SASE platform Toxic combinations: when small signals add up to a security incident We deserve a better streams API for JavaScript The most-seen UI on the Internet? Redesigning Turnstile and Challenge Pages ASPA: making Internet routing more secure Bringing more transparency to post-quantum usage, encrypted messaging, and routing security How we rebuilt Next.js with AI in one week Cloudflare One is the first SASE offering modern post-quantum encryption across the full platform Cloudflare outage on February 20, 2026

Cloudflare Team · 2020-06-05 · via The Cloudflare Blog

In this blog post, we will walk you through the reliability model of services running in our more than 200 edge cities worldwide. Then, we will go over how deploying a new dynamic task scheduling system, HashiCorp Nomad, helped us improve the availability of services in each of those data centers, covering how we deployed Nomad and the challenges we overcame along the way. Finally, we will show you both how we currently use Nomad and how we are planning on using it in the future.

Reliability model of services running in each data center

For this blog post, we will distinguish between two different categories of services running in each data center:

Customer-facing services: all of our stack of products that our customers use, such as caching, WAF, DDoS protection, rate-limiting, load-balancing, etc.
Management services: software required to operate the data center, that is not in the direct request path of customer traffic.

Customer-facing services

The reliability model of our customer-facing services is to run them on all machines in each data center. This works well as it allows each data center’s capacity to scale dynamically by adding more machines.

Scaling is especially made easy thanks to our dynamic load balancing system, Unimog, which runs on each machine. Its role is to continuously re-balance traffic based on current resource usage and to check the health of services. This helps provide resiliency to individual machine failures and ensures resource usage is close to identical on all machines.

As an example, here is the CPU usage over a day in one of our data centers where each time series represents one machine and the different colors represent different generations of hardware. Unimog keeps all machines processing traffic and at roughly the same CPU utilization.

Management services

Some of our larger data centers have a substantial number of machines, but sometimes we need to reliably run just a single or a few instances of a management service in each location.

There are currently a couple of options to do this, each have their own pros and cons:

Deploying the service to all machines in the data center:
- Pro: it ensures the service’s reliability
- Con: it unnecessarily uses resources which could have been used to serve customer traffic and is not cost-effective
Deploying the service to a static handful of machines in each data center:
- Pro: it is less wasteful of resources and more cost-effective
- Con: it runs the risk of service unavailability when those handful of machines unexpectedly fail

A third, more viable option, is to use dynamic task scheduling so that only the right amount of resources are used while ensuring reliability.

A need for more dynamic task scheduling

Having to pick between two suboptimal reliability model options for management services we want running in each data center was not ideal.

Indeed, some of those services, even though they are not in the request path, are required to continue operating the data center. If the machines running those services become unavailable, in some cases we have to temporarily disable the data center while recovering them. Doing so automatically re-routes users to the next available data center and doesn’t cause disruption. In fact, the entire Cloudflare network is designed to operate with data centers being disabled and brought back automatically. But it’s optimal to route end users to a data center near them so we want to minimize any data center level downtime.

This led us to realize we needed a system to ensure a certain number of instances of a service were running in each data center, regardless of which physical machine ends up running it.

Customer-facing services run on all machines in each data center and do not need to be onboarded to that new system. On the other hand, services currently running on a fixed subset of machines with sub-optimal reliability guarantees and services which don’t need to run on all machines are good candidates for onboarding.

Our pick: HashiCorp Nomad

Armed with our set of requirements, we conducted some research on candidate solutions.

While Kubernetes was another option, we decided to use HashiCorp’s Nomad for the following reasons:

Satisfies our initial requirement, which was reliably running a single instance of a binary with resource isolation in each data center.
Has few dependencies and a straightforward integration with Consul. Consul is another piece of HashiCorp software we had already deployed in each datacenter. It provides distributed key-value storage and service discovery capabilities.
Is lightweight (single Go binary), easy to deploy and provision new clusters which is a plus when deploying as many clusters as we have data centers.
Has a modular task driver (part responsible for executing tasks and providing resource isolation) architecture to support not only containers but also binaries and any custom task driver.
Is open source and written in Go. We have Go language experience within the team, and Nomad has a responsive community of maintainers on GitHub.

Deployment architecture

Nomad is split in two different pieces:

Nomad Server: instances forming the cluster responsible for scheduling, five per data center to provide sufficient failure tolerance
Nomad Client: instances executing the actual tasks, running on all machines in every data center

To guarantee Nomad Server cluster reliability, we deployed instances on machines which are part of different failure domains:

In different inter-connected physical data centers forming a single location
In different racks, connected to different switches
In different multi-node chassis (most of our edge hardware comes in the form of multi-node chassis, one chassis contains four individual servers)

We also added logic to our configuration management tool to ensure we always keep a consistent number of Nomad Server instances regardless of the expansions and decommissions of servers happening on a close to daily basis.

The logic is rather simple, as server expansions and decommissions happen, the Nomad Server role gets redistributed to a new list of machines. Our configuration management tool then ensures that Nomad Server runs on the new machines before turning it off on the old ones.

Additionally, because server expansions and decommissions affect a subset of racks at a time and the Nomad Server role assignment logic provides rack-diversity guarantees, the cluster stays healthy as quorum is kept at all times.

Job files

Nomad job files are templated and checked into a git repository. Our configuration management tool then ensures the jobs are scheduled in every data center. From there, Nomad takes over and ensures the jobs are running at all times in each data center.

By exposing rack metadata to each Nomad Client, we are able to make sure each instance of a particular service runs in a different rack and is tied to a different failure domain. This way we make sure that the failure of one rack of servers won’t impact the service health as the service is also running in a different rack, unaffected by the failure.

We achieve this with the following job file constraint:

constraint {
  attribute = "${meta.rack}"
  operator  = "distinct_property"
}

Service discovery

We leveraged Nomad integration with Consul to get Nomad jobs dynamically added to the Consul Service Catalog. This allows us to discover where a particular service is currently running in each data center by querying Consul. Additionally, with the Consul DNS Interface enabled, we can also use DNS-based lookups to target services running on Nomad.

Observability

To be able to properly operate as many Nomad clusters as we have data centers, good observability on Nomad clusters and services running on those clusters was essential.

We use Prometheus to scrape Nomad Server and Client instances running in each data center and Alertmanager to alert on key metrics. Using Prometheus metrics, we built a Grafana dashboard to provide visibility on each cluster.

We set up our Prometheus instances to discover services running on Nomad by querying the Consul Service Directory and scraping their metrics periodically using the following Prometheus configuration:

- consul_sd_configs:
  - server: localhost:8500
  job_name: management_service_via_consul
  relabel_configs:
  - action: keep
    regex: management-service
    source_labels:
    - __meta_consul_service

We then use those metrics to create Grafana dashboards and set up alerts for services running on Nomad.

To restrict access to Nomad API endpoints, we enabled mutual TLS authentication and are generating client certificates for each entity interacting with Nomad. This way, only entities with a valid client certificate can interact with Nomad API endpoints in order to schedule jobs or perform any CLI operation.

Challenges

Deploying a new component always comes with its set of challenges; here is a list of a few hurdles we have had to overcome along the way.

Initramfs rootfs and pivot_root

When starting to use the exec driver to run binaries isolated in a chroot environment, we noticed our stateless root partition running on initramfs was not supported as the task would not start and we got this error message in our logs:

Feb 12 19:49:03 machine nomad-client[258433]: 2020-02-12T19:49:03.332Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=fa202-63b-33f-924-42cbd5 task=server error="failed to launch command with executor: rpc error: code = Unknown desc = container_linux.go:346: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:109: jailing process inside rootfs caused \\\"pivot_root invalid argument\\\"\"""

We filed a GitHub issue and submitted a workaround pull request which was promptly reviewed and merged upstream.

In parallel, for maximum isolation security, we worked on enabling pivot_root in our setup by modifying our boot process and other team members developed and proposed a patch to the kernel mailing list to make it easier in the future.

Resource usage containment

One very important aspect was to make sure the resource usage of tasks running on Nomad would not disrupt other services colocated on the same machine.

Disk space is a shared resource on every machine and being able to set a quota for Nomad was a must. We achieved this by isolating the Nomad data directory to a dedicated fixed-size mount point on each machine. Limiting disk bandwidth and IOPS, however, is not currently supported out of the box by Nomad.

Nomad job files have a resources section where memory and CPU usage can be limited (memory is in MB, cpu is in MHz):

resources {
  memory = 2000
  cpu = 500
}

This uses cgroups under the hood and our testing showed that while memory limits are enforced as one would expect, the CPU limits are soft limits and not enforced as long as there is available CPU on the host machine.

Workload (un)predictability

As mentioned above, all machines currently run the same customer-facing workload. Scheduling individual jobs dynamically with Nomad to run on single machines challenges that assumption.

While our dynamic load balancing system, Unimog, balances requests based on resource usage to ensure it is close to identical on all machines, batch type jobs with spiky resource usage can pose a challenge.

We will be paying attention to this as we onboard more services and:

attempt to limit resource usage spikiness of Nomad jobs with constraints aforementioned
ensure Unimog adjusts to this batch type workload and does not end up in a positive feedback loop

What we are running on Nomad

Now Nomad has been deployed in every data center, we are able to improve the reliability of management services essential to operations by gradually onboarding them. We took a first step by onboarding our reboot and maintenance management service.

Reboot and maintenance management service

In each data center, we run a service which facilitates online unattended rolling reboots and maintenance of machines. This service used to run on a single well-known machine in each data center. This made it vulnerable to single machine failures and when down prevented machines from enabling automatically after a reboot. Therefore, it was a great first service to be onboarded to Nomad to improve its reliability.

We now have a guarantee this service is always running in each data center regardless of individual machine failures. Instead of other machines relying on a well-known address to target this service, they now query Consul DNS and dynamically figure out where the service is running to interact with it.

This is a big improvement in terms of reliability for this service, therefore many more management services are expected to follow in the upcoming months and we are very excited for this to happen.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Cloudflare Blog

Reliability model of services running in each data center

Customer-facing services

Management services

A need for more dynamic task scheduling

Our pick: HashiCorp Nomad

Deployment architecture

Job files

Service discovery

Observability

Challenges

Initramfs rootfs and pivot_root

Resource usage containment

Workload (un)predictability

What we are running on Nomad

Reboot and maintenance management service