惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

爱范儿
爱范儿
E
Exploit-DB.com RSS Feed
Google DeepMind News
Google DeepMind News
F
Full Disclosure
D
Darknet – Hacking Tools, Hacker News & Cyber Security
T
ThreatConnect
Stack Overflow Blog
Stack Overflow Blog
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
G
GRAHAM CLULEY
C
Check Point Blog
T
Threatpost
I
Intezer
Spread Privacy
Spread Privacy
The Register - Security
The Register - Security
Project Zero
Project Zero
月光博客
月光博客
人人都是产品经理
人人都是产品经理
阮一峰的网络日志
阮一峰的网络日志
D
DataBreaches.Net
IT之家
IT之家
Malwarebytes
Malwarebytes
T
The Blog of Author Tim Ferriss
P
Privacy International News Feed
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
量子位
李成银的技术随笔
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Cisco Talos Blog
Cisco Talos Blog
Know Your Adversary
Know Your Adversary
美团技术团队
The GitHub Blog
The GitHub Blog
T
Tor Project blog
M
MIT News - Artificial intelligence
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
有赞技术团队
有赞技术团队
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
博客园 - 司徒正美
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
C
Comments on: Blog
T
Threat Research - Cisco Blogs
aimingoo的专栏
aimingoo的专栏
Security Latest
Security Latest
NISL@THU
NISL@THU
The Cloudflare Blog
H
Help Net Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main

The Cloudflare Blog

The day my ping took countermeasures Announcing Claude Compliance API support with Cloudflare CASB Announcing Claude Managed Agents on Cloudflare Project Glasswing: what Mythos showed us Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse Browser Run: now running on Cloudflare Containers, it’s faster and more scalable When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug Building For The Future How Cloudflare responded to the “Copy Fail” Linux vulnerability When DNSSEC goes wrong: how we responded to the .de TLD outage Code Orange: Fail Small is complete. The result is a stronger Cloudflare network Introducing Dynamic Workflows: durable execution that follows the tenant Post-quantum encryption for Cloudflare IPsec is generally available Agents can now create Cloudflare accounts, buy domains, and deploy Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen Moving past bots vs. humans Building the agentic cloud: everything we launched during Agents Week 2026 The AI engineering stack we built internally — on the platform we ship Orchestrating AI Code Review at scale Introducing the Agent Readiness score. Check to see if your site is agent-ready Shared Dictionaries: compression that keeps up with the agentic web Redirects for AI Training enforces canonical content Unweight: how we compressed an LLM 22% without sacrificing quality Agents that remember: introducing Agent Memory Agents Week: network performance update Introducing Flagship: feature flags built for the age of AI Cloudflare’s AI Platform: an inference layer designed for agents Building the foundation for running extra-large language models AI Search: the search primitive for your agents Deploy Postgres and MySQL databases with PlanetScale + Workers Artifacts: versioned storage that speaks Git Email for agents - Cloudflare Email Service now in public beta Project Think: building the next generation of AI agents on Cloudflare Introducing Agent Lee - a new interface to the Cloudflare stack Register domains wherever you build: Cloudflare Registrar API now in beta Browser Run: give your agents a browser Rearchitecting the Workflows control plane for the agentic era Add voice to your agent Managed OAuth for Access: make internal apps agent-ready in one click Securing non-human identities: automated revocation, OAuth, and scoped permissions Scaling MCP adoption: Our reference architecture for simpler, safer and cheaper enterprise deployments of MCP Secure private networking for everyone: users, nodes, agents, Workers — introducing Cloudflare Mesh Building a CLI for all of Cloudflare Durable Objects in Dynamic Workers: Give each AI-generated app its own database Agents have their own computers with Sandboxes GA Dynamic, identity-aware, and secure Sandbox auth Welcome to Agents Week 500 Tbps of capacity: 16 years of scaling our global network From bytecode to bytes- automated magic packet generation Cloudflare targets 2029 for full post-quantum security How we built Organizations to help enterprises manage Cloudflare at scale Why we're rethinking cache for the AI era Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver Introducing EmDash — the spiritual successor to WordPress that solves plugin security Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers Cloudflare Client-Side Security: smarter detection, now open to everyone How we use Abstract Syntax Trees (ASTs) to turn Workflows code into visual diagrams A one-line Kubernetes fix that saved 600 hours a year Sandboxing AI agents, 100x faster Inside Gen 13- how we built our most powerful server yet Launching Cloudflare’s Gen 13 servers- trading cache for cores for 2x edge compute performance Powering the agents: Workers AI now runs large models, starting with Kimi K2.5 Introducing Custom Regions for precision data control Standing up for the open Internet- why we appealed Italy’s Piracy Shield fine From legacy architecture to Cloudflare One Announcing Cloudflare Account Abuse Protection: prevent fraudulent attacks from bots and humans Slashing agent token costs by 98% with RFC 9457-compliant error responses AI Security for Apps is now generally available Building a security overview dashboard for actionable insights Investigating multi-vector attacks in Log Explorer Translating risk insights into actionable protection: leveling up security posture with Cloudflare and Mastercard Fixing request smuggling vulnerabilities in Pingora OSS deployments Active defense: introducing a stateful vulnerability scanner for APIs Complexity is a choice. SASE migrations shouldn’t take years. From the endpoint to the prompt: a unified data security vision in Cloudflare One Ending the "silent drop": how Dynamic Path MTU Discovery makes the Cloudflare One Client more resilient A QUICker SASE client: re-building Proxy Mode How Automatic Return Routing solves IP overlap Always-on detections: eliminating the WAF “log versus block” trade-off Mind the gap: new tools for continuous enforcement from boot to login Stop reacting to breaches and start preventing them with User Risk Scoring Defeating the deepfake: stopping laptop farms and insider threats Moving from license plates to badges: the Gateway Authorization Proxy Evolving Cloudflare’s Threat Intelligence Platform: actionable, scalable, and ETL-less Introducing the 2026 Cloudflare Threat Report See risk, fix risk: introducing Remediation in Cloudflare CASB How Cloudy translates complex security into human action From reactive to proactive: closing the phishing gap with LLMs Modernizing with agile SASE: a Cloudflare One blog takeover Beyond the blank slate: how Cloudflare accelerates your Zero Trust journey The truly programmable SASE platform Toxic combinations: when small signals add up to a security incident We deserve a better streams API for JavaScript The most-seen UI on the Internet? Redesigning Turnstile and Challenge Pages ASPA: making Internet routing more secure Bringing more transparency to post-quantum usage, encrypted messaging, and routing security How we rebuilt Next.js with AI in one week Cloudflare One is the first SASE offering modern post-quantum encryption across the full platform Cloudflare outage on February 20, 2026
Debugging war story: the mystery of NXDOMAIN
Cloudflare Team · 2016-12-07 · via The Cloudflare Blog

2016-12-07

5 min read

The following blog post describes a debugging adventure on Cloudflare's Mesos-based cluster. This internal cluster is primarily used to process log file information so that Cloudflare customers have analytics, and for our systems that detect and respond to attacks.

The problem encountered didn't have any effect on our customers, but did have engineers scratching their heads...

The Problem

At some point in one of our cluster we started seeing errors like this (an NXDOMAIN for an existing domain on our internal DNS):

lookup some.existing.internal.host on 10.1.0.9:53: no such host

This seemed very weird, since the domain did indeed exist. It was one of our internal domains! Engineers had mentioned that they'd seen this behaviour, so we decided to investigate deeper. Queries triggering this error were varied and ranged from dynamic SRV records managed by mesos-dns to external domains looked up from inside the cluster.

Our first naive attempt was to run the following in a loop:

while true; do dig some.existing.internal.host > /tmp/dig.txt || break; done

Running this for a while on one server did not reproduce the problem: all the lookups were successful. Then we took our service logs for a day and did a grep for “no such host” and similar messages. Errors were happening sporadically. There were hours between errors and no obvious pattern that could lead us to any conclusion. Our investigation discarded the possibility that the error lay in Go, which we use for lots of our services, since errors were coming from Java services too.

Into the rabbit hole

We used to run Unbound on a single IP across a few machines for our cluster DNS resolver. BGP is then responsible for announcing internal routes from the machines to the router. We decided to try to find a pattern by sending lots of requests from different machines and recording errors. Here’s what our load testing program looked like at first:

package main

import (
	"flag"
	"fmt"
	"net"
	"os"
	"time"
)

func main() {
	n := flag.String("n", "", "domain to look up")
	p := flag.Duration("p", time.Millisecond*10, "pause between lookups")

	flag.Parse()

	if *n == "" {
		flag.PrintDefaults()
		os.Exit(1)
	}

	for {
		_, err := net.LookupHost(*n)
		if err != nil {
			fmt.Println("error:", err)
		}

		time.Sleep(*p)
	}
}

We run net.LookupHost in a loop with small pauses and log errors; that’s it. Packaging this into a Docker container and running on Marathon was an obvious choice for us, since that is how we run other services anyway. Logs get shipped to Kafka and then to Kibana, where we can analyze them. Running this program on 65 machines doing lookups every 50ms shows the following error distribution (from high to low) across hosts:

We saw no strong correlation to racks or specific machines. Errors happened on many hosts, but not on all of them and in different time windows errors happen on different machines. Putting time on X axis and number of errors on Y axis showed the following:

To see if some particular DNS recursor had gone crazy, we stopped all load generators on regular machines and started the load generation tool on the recursors themselves. There were no errors in a few hours, which suggested that Unbound was perfectly healthy.

We started to suspect that packet loss was the issue, but why would “no such host” occur? It should only happen when an NXDOMAIN error is in a DNS response, but our theory was that replies didn’t come back at all.

The Missing

To test the hypothesis that losing packets can lead to a “no such host” error, we first tried blocking outgoing traffic on port 53:

sudo iptables -A OUTPUT -p udp --dport 53 -j DROP

In this case, dig and similar tools just time out, but don’t return “no such host”:

; <<>> DiG 9.9.5-9+deb8u3-Debian <<>> cloudflare.com
;; global options: +cmd
;; connection timed out; no servers could be reached

Go is a bit smarter and tells you more about what’s going on, but doesn't return “no such host” either:

error: lookup cloudflare.com on 10.1.0.9:53: write udp 10.1.14.20:47442->10.1.0.9:53: write: operation not permitted

Since the Linux kernel tells the sender that it dropped packets, we had to point the nameserver to some black hole in the network that does nothing with packets to mimic packet loss. Still no luck:

error: lookup cloudflare.com on 10.1.2.9:53: read udp 10.1.14.20:39046->10.1.2.9:53: i/o timeout

To continue blaming the network we had to support our assumptions somehow, so we added timing information to our lookups:

s := time.Now()
_, err := net.LookupHost(*n)
e := time.Now().Sub(s).Seconds()
if err != nil {
    log.Printf("error after %.4fs: %s", e, err)
} else if e > 1 {
    log.Printf("success after %.4fs", e)
}

To be honest, we started by timing errors and added success timing later. Errors were happening after 10s, comparatively many successful responses were coming after 5s. It does look like packet loss, but still does not tell us why “no such host” happens.

Since now we were in a place when we knew which hosts were more likely to be affected by this, we ran the following two commands in parallel in two screen sessions:

while true; do dig cloudflare.com > /tmp/dig.log || break; done; date; sudo killall tcpdump
sudo tcpdump -w /state/wtf.cap port 53

The point was to get a network dump with failed resolves. In there, we saw the following queries:

00.00s A cloudflare.com
05.00s A cloudflare.com
10.00s A cloudflare.com.in.our.internal.domain

Two queries time out without any answer, but the third one gets lucky and succeeds. Naturally, we don’t have cloudflare.com in our internal domain, so Unbound rightfully gives NXDOMAIN in reply, 10s after the lookup was initiated.

Bingo

Let’s look at /etc/resolv.conf to understand more:

nameserver 10.1.0.9
search in.our.internal.domain

Using the search keyword allows us to use short hostnames instead of FQDN, making myhost transparently equivalent to myhost.in.our.internal.domain.

For the DNS resolver it means the following: for any DNS query ask the nameserver 10.1.0.9, if this fails, append .in.our.internal.domain to the query and retry. It doesn’t matter what failure occurs for the original DNS query. Usually it is NXDOMAIN, but in our case it’s a read timeout due to packet loss.

The following events seemed to have to occur for a “no such host” error to appear:

  1. The original DNS request has to be lost

  2. The retry that is sent after 5 seconds has to be lost

  3. The subsequent query for the internal domain (caused by the search option) has to succeed and return NXDOMAIN

On the other hand, to observe a timed out DNS query instead of NXDOMAIN, you have to lose four packets sent 5 seconds one after another (2 for the original query and 2 for the internal version of your domain), which is a much smaller probability. In fact, we only saw an NXDOMAIN after 15s once and never saw an error after 20s.

To validate that assumption, we built a proof-of-concept DNS server that drops all requests for cloudflare.com, but sends an NXDOMAIN for existing domains:

package main

import (
	"github.com/miekg/dns"
	"log"
)

func main() {
	server := &dns.Server{Addr: ":53", Net: "udp"}

	dns.HandleFunc(".", func(w dns.ResponseWriter, r *dns.Msg) {
		m := &dns.Msg{}
		m.SetReply(r)

		for _, q := range r.Question {
			log.Printf("checking %s", q.Name)
			if q.Name == "cloudflare.com." {
				log.Printf("ignoring %s", q.Name)
				// just ignore
				return
			}
		}


		w.WriteMsg(m)
	})

	log.Printf("listening..")

	if err := server.ListenAndServe(); err != nil {
		log.Fatalf("error listening: %s", err)
	}
}

Finally, we found what was going on and had a way of reliably replicating that behaviour.

Solutions

Let's think about how we can improve our client to better handle these transient network issues, making it more resilient. The man page for resolv.conf tells you that you have two knobs: the timeout and retries options. The default values are 5 and 2 respectively.

Unless you keep your DNS server very busy, it is very unlikely that it would take it more than 1 second to reply. In fact, if you happen to have a network device on the Moon, you can expect it to reply in 3 seconds. If your nameserver lives in the next rack and is reachable over a high-speed network, you can safely assume that if there is no reply after 1 second, your DNS server did not get your query. If you want to have less weird “no such domain” errors that make you scratch your head, you might as well increase retries. The more times you retry with transient packet loss, the less chance of failure. The more often you retry, the higher chances to finish faster.

Imagine that you have truly random 1% packet loss.

  • 2 retries, 5s timeout: max 10s wait before error, 0.001% chance of failure

  • 5 retries, 1s timeout: max 5s wait before error, 0.000001% chance of failure

In real life, the distribution would be different due to the fact that packet loss is not random, but you can expect to wait much less for DNS to reply with this type of change.

As you know many system libraries that provide DNS resolution like glibc, nscd, systemd-resolved are not hardened to handle being on the internet or in a environment with packet losses. We have faced the challenge of creating a reliable and fast DNS resolution environment a number of times as we have grown, only to later discover that the solution is not perfect.

Over to you

Given what you have read in this article about packet loss and split-DNS/private-namespace, how would you design a fast and reliable resolution setup? What software would you use and why? What tuning changes from standard configuration would you use?

We'd love to hear your ideas in the comments. And, if you like working on problems like send us your resume. We are hiring.

DNSReliability