The benefits of serving stale DNS entries when using Consul

The Cloudflare Blog

The day my ping took countermeasures Announcing Claude Compliance API support with Cloudflare CASB Announcing Claude Managed Agents on Cloudflare Project Glasswing: what Mythos showed us Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse Browser Run: now running on Cloudflare Containers, it’s faster and more scalable When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug Building For The Future How Cloudflare responded to the “Copy Fail” Linux vulnerability When DNSSEC goes wrong: how we responded to the .de TLD outage Code Orange: Fail Small is complete. The result is a stronger Cloudflare network Introducing Dynamic Workflows: durable execution that follows the tenant Post-quantum encryption for Cloudflare IPsec is generally available Agents can now create Cloudflare accounts, buy domains, and deploy Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen Moving past bots vs. humans Building the agentic cloud: everything we launched during Agents Week 2026 The AI engineering stack we built internally — on the platform we ship Orchestrating AI Code Review at scale Introducing the Agent Readiness score. Check to see if your site is agent-ready Shared Dictionaries: compression that keeps up with the agentic web Redirects for AI Training enforces canonical content Unweight: how we compressed an LLM 22% without sacrificing quality Agents that remember: introducing Agent Memory Agents Week: network performance update Introducing Flagship: feature flags built for the age of AI Cloudflare’s AI Platform: an inference layer designed for agents Building the foundation for running extra-large language models AI Search: the search primitive for your agents Deploy Postgres and MySQL databases with PlanetScale + Workers Artifacts: versioned storage that speaks Git Email for agents - Cloudflare Email Service now in public beta Project Think: building the next generation of AI agents on Cloudflare Introducing Agent Lee - a new interface to the Cloudflare stack Register domains wherever you build: Cloudflare Registrar API now in beta Browser Run: give your agents a browser Rearchitecting the Workflows control plane for the agentic era Add voice to your agent Managed OAuth for Access: make internal apps agent-ready in one click Securing non-human identities: automated revocation, OAuth, and scoped permissions Scaling MCP adoption: Our reference architecture for simpler, safer and cheaper enterprise deployments of MCP Secure private networking for everyone: users, nodes, agents, Workers — introducing Cloudflare Mesh Building a CLI for all of Cloudflare Durable Objects in Dynamic Workers: Give each AI-generated app its own database Agents have their own computers with Sandboxes GA Dynamic, identity-aware, and secure Sandbox auth Welcome to Agents Week 500 Tbps of capacity: 16 years of scaling our global network From bytecode to bytes- automated magic packet generation Cloudflare targets 2029 for full post-quantum security How we built Organizations to help enterprises manage Cloudflare at scale Why we're rethinking cache for the AI era Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver Introducing EmDash — the spiritual successor to WordPress that solves plugin security Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers Cloudflare Client-Side Security: smarter detection, now open to everyone How we use Abstract Syntax Trees (ASTs) to turn Workflows code into visual diagrams A one-line Kubernetes fix that saved 600 hours a year Sandboxing AI agents, 100x faster Inside Gen 13- how we built our most powerful server yet Launching Cloudflare’s Gen 13 servers- trading cache for cores for 2x edge compute performance Powering the agents: Workers AI now runs large models, starting with Kimi K2.5 Introducing Custom Regions for precision data control Standing up for the open Internet- why we appealed Italy’s Piracy Shield fine From legacy architecture to Cloudflare One Announcing Cloudflare Account Abuse Protection: prevent fraudulent attacks from bots and humans Slashing agent token costs by 98% with RFC 9457-compliant error responses AI Security for Apps is now generally available Building a security overview dashboard for actionable insights Investigating multi-vector attacks in Log Explorer Translating risk insights into actionable protection: leveling up security posture with Cloudflare and Mastercard Fixing request smuggling vulnerabilities in Pingora OSS deployments Active defense: introducing a stateful vulnerability scanner for APIs Complexity is a choice. SASE migrations shouldn’t take years. From the endpoint to the prompt: a unified data security vision in Cloudflare One Ending the "silent drop": how Dynamic Path MTU Discovery makes the Cloudflare One Client more resilient A QUICker SASE client: re-building Proxy Mode How Automatic Return Routing solves IP overlap Always-on detections: eliminating the WAF “log versus block” trade-off Mind the gap: new tools for continuous enforcement from boot to login Stop reacting to breaches and start preventing them with User Risk Scoring Defeating the deepfake: stopping laptop farms and insider threats Moving from license plates to badges: the Gateway Authorization Proxy Evolving Cloudflare’s Threat Intelligence Platform: actionable, scalable, and ETL-less Introducing the 2026 Cloudflare Threat Report See risk, fix risk: introducing Remediation in Cloudflare CASB How Cloudy translates complex security into human action From reactive to proactive: closing the phishing gap with LLMs Modernizing with agile SASE: a Cloudflare One blog takeover Beyond the blank slate: how Cloudflare accelerates your Zero Trust journey The truly programmable SASE platform Toxic combinations: when small signals add up to a security incident We deserve a better streams API for JavaScript The most-seen UI on the Internet? Redesigning Turnstile and Challenge Pages ASPA: making Internet routing more secure Bringing more transparency to post-quantum usage, encrypted messaging, and routing security How we rebuilt Next.js with AI in one week Cloudflare One is the first SASE offering modern post-quantum encryption across the full platform Cloudflare outage on February 20, 2026

Cloudflare Team · 2021-03-08 · via The Cloudflare Blog

Introduction

We use Consul for service discovery, and we’ve deployed a cluster that spans several of our data centers. This cluster exposes HTTP and DNS interfaces so that clients can query the Consul catalog and search for a particular service and the majority of the clients use DNS. We were aware from the start that the DNS query latencies were not great from certain parts of the world that were furthest away from these data centers. This, together with the fact that we use DNS over TLS, results in some long latencies. The TTL of these names being low makes it even more impractical when resolving these names in the hot path.

The usual way to solve these issues is by caching values so that at least subsequent requests are resolved quickly, and this is exactly what our resolver of choice, Unbound, is configured to do. The problem remains when the cache expires. When it expires, the next client will have to wait while Unbound resolves the name using the network. To have a low recovery time in case some service needs to failover and clients need to use another address we use a small TTL (30 seconds) in our records and as such cache expirations happen often.

There are at least two strategies to deal with this: prefetching and stale cache.

Prefetching

When prefetching is turned on, the server tries to refresh DNS records in the background before they expire. In practice, the way this works is: if an entry is served from cache and the TTL is less than 10% of the lifetime of the records, the server responds to the client, but in the background, it dispatches a job to refresh the record across the network. If it all goes as planned, the entry is refreshed in the cache before the TTL expires, and no client is ever left waiting for the network resolution.

This works very well for records that are accessed frequently, but poses issues in two situations: records that are accessed infrequently and records that have a small TTL. In both cases it is unlikely that a prefetch would be triggered, and so the cache expires and the next client will have to wait for the network resolution.

In our case, the records have a 30 second TTL, so frequently no requests come in during the yellow "prefetch" stage, which bumps the records out of the cache and causes the next client to have to wait for the slow network resolution. Prefetching was already turned on in our Unbound configuration but this still was not enough for our use case.

Stale cache

Another option is to serve stale records from the cache while we dispatch a job to refresh the record in the background without keeping the client waiting for the response. This would provide benefits in two different scenarios:

General high latency - even after the TTL expires subsequent attempts will be returned from cache if the response takes too long to arrive.
Consul/network blips - if for some reason Consul is unavailable/unreachable, there is still a period of time that expired responses can be used instead of failing resolution.

The first case is exactly our current scenario.

Being able to serve stale entries from the cache for a period after the TTL expires affords us some time in which we can return a result to a client either immediately or after some small amount of time has passed. The latter is the behaviour that we are looking for which also makes us compliant with RFC 8767, “Serving Stale Data to Improve DNS Resiliency”, published March 2020.

Previous dealings with Unbound's stale cache

Stale cache had been tried before at Cloudflare in 2017, and at that time, we encountered two problems:

1. Unbound could serve stale records from cache indefinitelyWe could solve this through configuration that did not exist in 2017, but we had to test it.

2. Unbound serves negative/incomplete information from the expired cacheIt was reported that Unbound would serve incomplete chains of responses to clients (for example, a CNAME record with no corresponding A/AAAA record).

This turned out to be unconfirmed but it is a failure mode that nevertheless we wanted to make sure Unbound could handle since it would be possible for parts of the response chain to be purged from the stale cache before others.

Test Configuration

As mentioned before we would need to test a current version of Unbound, come up with an up-to-date stale cache configuration, and test the scenarios that posed problems when we last tried to use it plus any others that we came up with. The Unbound version used for tests was 1.13.0 and the stale cache was configured thus:

serve-expired: yes - enable the stale cache.
serve-expired-ttl: 3600 - this limits the serving of expired resources to one hour after they expire and no response from upstream was obtained.
serve-expired-client-timeout: 500 - serve from stale cache only when upstream takes longer than 500ms to respond.

The option serve-expired-ttl-reset was considered, which would cause the records to never expire, even with the upstream down indefinitely as long as there are regular requests for them, but it was deemed too dangerous since it could potentially last forever and applies to all entries.

Example consul zone: service1.sd.

While resolving this name we end up with:

;; ANSWER SECTION:
service1.sd. 120 IN CNAME service1.query.consul.
service1.query.consul. 30 IN A 192.0.2.1
 
;; Query time: 230 msec

If we dump Unbound's cache (unbound-control dump_cache), we can find the entries there:

service1.query.consul. 23  IN  A   192.0.2.1
service1.sd.   113 IN  CNAME   service1.query.consul.

If we wait for the TTL of these entries to expire and dump the cache again, we notice that they don't show up anymore:

➜ unbound-control dump_cache | grep -e query.consul -e service1
➜

But if we block access to the Consul resolver, we can see that still the entries are returned:

;; ANSWER SECTION:
service1.sd. 119 IN CNAME service1.query.consul.
service1.query.consul. 30 IN A 192.0.2.1
 
;; Query time: 503 msec

Notice that Unbound waited 500ms before returning a response, which was the value we configured. This also shows that the dump cache method does not return expired entries even though they are present. So far so good.

Test - Unbound could serve stale records from cache indefinitely

After 3600 seconds pass we again try to resolve the same name:

➜ dig  service1.sd. @127.0.0.1
 
; <<>> DiG 9.10.6 <<>> service1.sd. @127.0.0.1
;; global options: +cmd
;; connection timed out; no servers could be reached

Dig gave up on waiting for a response from Unbound. So we see that after the configured time, Unbound will indeed stop serving the stale record, even if it has been requested recently. This is due to serve-expired-ttl: 3600 and serve-expired-ttl-reset being set to "no" by default. These options were not available when we first tried Unbound’s stale cache.

Test - Unbound serves negative/incomplete information from the expired cache

Since the record we are resolving includes a CNAME and an A record with different TTLs, we will try to see if Unbound can be made to return an incomplete chain (CNAME with no A record).

We start by having both entries in cache, and block connections to the Consul resolver. We then remove from the cache the A entry:

$ /usr/sbin/unbound-control flush_type service1.query.consul. A
ok

Now we attempt to query:

$ dig  service1.sd. @127.0.0.1
...
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 9532
;; Query time: 22 msec

Nothing was returned which is the result we were expecting.

Another scenario we wanted to test was to see if when serve-expired-ttl + A's TTL had passed, the CNAME would be returned on its own.

➜ dig  service1.sd. @127.0.0.1
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48918
;; ANSWER SECTION:
service1.sd. 48 IN CNAME service1.query.consul.
service1.query.consul. 30 IN A 192.0.2.1
;; Query time: 430 msec
;; WHEN: Thu Dec 31 17:07:55 WET 2020
 
➜ dig  service1.sd. @127.0.0.1
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 42432
;; Query time: 509 msec
;; WHEN: Thu Dec 31 17:07:56 WET 2020

We lowered serve-expired-ttl to 40 to help testing the issue and we were able to see that as soon as the A entry was deemed not serve-able, the CNAME was also not returned, which is the behaviour that we want.

One last thing that we wanted to test was if the cache was really refreshed in the background while the stale entry was returned. Network Link Conditioner was used to simulate high latency:

➜ dig service1.service.consul @CONSUL_ADDRESS
 
;; ANSWER SECTION:
service1.service.consul. 60 IN A 192.0.2.10
service1.service.consul. 60 IN A 192.0.2.12
 
;; Query time: 3053 msec

We can see that the artificial latency is working. We can now test if changes to the address are actually reflected on the consul record even though a stale record is served at first:

➜ dig service2.service.consul @127.0.0.1
...
;; ANSWER SECTION:
service2.service.consul. 30 IN A 192.0.2.100
 
;; Query time: 501 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Sat Jan 02 13:00:36 WET 2021
;; MSG SIZE  rcvd: 89
 
➜ dig service2.service.consul @127.0.0.1
...
;; ANSWER SECTION:
service2.service.consul. 58 IN A 192.0.2.200
 
;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Sat Jan 02 13:00:39 WET 2021
;; MSG SIZE  rcvd: 89

We can see that the first attempt was served from the stale cache, since the address was already changed on Consul to 192.0.2.200, and the query time was 500ms. The next attempt was served from the fresh cache, hence the 0ms response time, and so we know that the cache was refreshed in the background.

Test notes

All the test scenarios completed successfully, even though we did find an issue that has since been fixed upstream. We realised that the prefetching logic blocked the cache while it attempted to refresh the entry, which prevented stale entries from being served if the upstream was unresponsive. This fix is included in version 1.13.1.

Deployment status

This change has already been deployed to all of our data centers, and while the change addressed our original use case, there were some interesting side-effects as can be seen from the graphs below.

Decreased in-flight requests

The graph above is from the Toronto data center which received the change at approximately 14:00 UTC. We can observe an attenuated curve, which indicates that spikes in the number of in-flight requests were probably caused by slow/unresponsive upstreams which are now served a stale response after a short while.

The same graph from the Perth data center, which received the change on a later day.

Decrease in P99 latency

The graph above is from the Lisbon data center_,_ which received the change at approximately 12:37 UTC. We can see that the long tail (P99) of request latencies decreases dramatically as we are now able to serve a response to clients in more situations without having them wait for a slow upstream network resolution that might not even succeed.

The effect is even more dramatic on the Paris data center.

The decrease in P99 latencies directly correlates to the start of serving stale entries, even though there are not many of them. In the graph above we can observe the number of stale records served per second for the same data center, using the last 3 minutes of datapoints.

Decreased general error rate

This graph is from our Perth data center, which received the change on February 8. This is explained by now being able to serve some requests to unresponsive/slow upstreams, which previously would have resulted in a failed request.

Graph from the Vientiane data center, which received the change on February 1.

Less jitter on the rate of load balancer queries by RRDNS

Another graph from the Perth data center, which received the change on February 8. An explanation for this change is that resolvers usually retry after a small amount of time when they don't get any response, and so by returning a stale response instead of keeping the client waiting (and retrying), we prevent a surge of new queries for expired and slow to resolve records.

A graph from the Doha data center, which received the change on February 1.

Miscellaneous Unbound metrics

This change also had a positive effect on several Unbound-specific metrics, below are several graphs from the Perth data center that received the change on February 8.

How it started vs how it's going

Further improvements

In the future, we would like to tweak the serve-expired-client-timeout value, which is currently set to 500ms. It was chosen conservatively, at a time when we were cautiously experimenting with a feature that had been problematic in the past, and our gut feeling is that it can be at least halved without causing issues to further improve the client’s experience. There is also the possibility of using different values per data center, for example picking up the P90 resolve latency value which differs per datacenter, and this is something that we want to explore further.

Summary

We are enabling Unbound to return expired entries from the cache whenever an upstream takes longer than 500ms to respond, while refreshing the cache in the background, unless 3600 seconds have elapsed from the record's original TTL.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Cloudflare Blog

Introduction

Prefetching

Stale cache

Previous dealings with Unbound's stale cache

Test Configuration

Test - Unbound could serve stale records from cache indefinitely

Test - Unbound serves negative/incomplete information from the expired cache

Test notes

Deployment status

Decreased in-flight requests

Decrease in P99 latency

Decreased general error rate

Less jitter on the rate of load balancer queries by RRDNS

Miscellaneous Unbound metrics

Further improvements

Summary