The Hidden Linux Routing Issue That Broke My Deployment

The deployment should have taken a few minutes.

The application was running, DNS was configured correctly, and the domain was already pointing to the server's public IP. Caddy was configured as a reverse proxy and was listening on ports 80 and 443. Every item on my deployment checklist appeared healthy.

Yet every Let's Encrypt validation attempt kept failing.

The error looked simple enough:

authorization failed
timeout during connect
likely firewall problem

At first, I believed it.

I checked DNS resolution, verified firewall rules, confirmed that Caddy was listening on the expected ports, and made sure the application itself was reachable. Every check came back clean.

That was the first clue that the problem might not be where the logs were pointing.

The Obvious Things

The first assumption was DNS.

I verified that the domain resolved to the correct public IP.

dig +short my-domain.com

Everything looked correct.

Next came the firewall.

sudo ufw status

Ports 80 and 443 were open. There were no unexpected deny rules, and nothing suggested inbound traffic was being blocked.

Then I checked whether Caddy was actually listening.

sudo ss -tulpn | grep -E ':80|:443'

Again, everything looked normal.

The application itself was healthy too.

curl http://localhost:3001

returned a valid response.

At this point I had checked most of the things engineers typically check when certificate validation fails. DNS looked good, the firewall looked good, the reverse proxy was healthy, and the application was running.

Yet the validation errors continued.

The Part That Sent Me In The Wrong Direction

The error messages kept mentioning connectivity problems and possible firewall issues.

That wording influenced my thinking more than it should have.

I spent time investigating firewall rules, reverse proxy configuration, TLS settings, and domain configuration. Every new hypothesis felt reasonable, but none of them explained why local tests consistently succeeded while external validation continued to fail.

The contradiction kept bothering me.

If the service was truly unreachable, why did everything work from inside the server?

Then I Hit The Rate Limit

This was the point where I realized I was no longer troubleshooting.

I was guessing.

After several failed validation attempts, Let's Encrypt stopped accepting new authorization requests and returned a rate-limit error.

too many failed authorizations

I had burned through multiple validation attempts without actually understanding the root cause.

Looking back, this was probably the most useful lesson from the entire incident.

Repeatedly retrying a failing system is not the same thing as debugging it.

Looking At The Network Instead Of The Logs

At this point I stopped changing configurations and started gathering evidence.

The first useful clue came from tcpdump.

sudo tcpdump -ni ens3 tcp port 80

While monitoring traffic, I triggered requests from outside the server.

The packet capture immediately showed incoming connection attempts reaching the machine.

That was important.

It meant DNS was working.

It meant external traffic was reaching the public interface.

It meant the firewall was not silently dropping inbound requests.

The requests were arriving exactly where they were supposed to.

So why was validation timing out?

The Routing Table Finally Revealed The Problem

The next step was checking the routing table.

ip route

The output looked roughly like this:

default via 10.2.0.1 dev ens4 metric 100
default via 51.x.x.x dev ens3 metric 100

The server had two network interfaces.

ens3 connected to the public network
ens4 connected to a private network

Initially, I didn't think much of it. Multi-interface servers are fairly common.

Then I started checking where outbound traffic was actually leaving.

ip route get 8.8.8.8

The result surprised me.

8.8.8.8 via 10.2.0.1 dev ens4

I tested several additional destinations.

ip route get 1.1.1.1
ip route get 8.8.4.4
ip route get <validator-ip>

Every single lookup showed outbound traffic leaving through the private interface.

That was the breakthrough.

Understanding What Was Actually Happening

A Quick Note About Asymmetric Routing

The issue I was dealing with has a name: asymmetric routing.

Traffic was entering the server through the public interface (ens3), but Linux was attempting to send replies through the private interface (ens4).

From the application's perspective everything looked healthy.

From Let's Encrypt's perspective the connection never completed successfully.

Why This Can Cause Timeouts

While investigating the issue, I came across Linux's Reverse Path Filtering (rp_filter).

When a packet arrives on one interface but Linux believes the reply should leave through another, the kernel may treat the traffic as suspicious and drop it.

Whether the packet was being dropped by rp_filter, upstream networking, or another layer wasn't something I conclusively proved.

But understanding this interaction finally explained why inbound requests were visible while validation attempts still timed out.

Let's Encrypt validators were connecting to my public IP.

Those packets arrived through the public interface.

Let's Encrypt
      |
      v
Public Interface (ens3)
      |
      v
    Server

So far, everything was fine.

The problem appeared when Linux generated a response.

Instead of sending the response back through the same public interface, the routing table was selecting the private interface as the preferred outbound path.

Let's Encrypt
      |
      v
Public Interface (ens3)
      |
      v
    Server
      |
      v
Private Interface (ens4)

This is a classic networking issue known as asymmetric routing.

Traffic enters through one interface and attempts to leave through another.

From the application's perspective, everything appears healthy.

From the remote system's perspective, the connection never completes correctly.

The result is timeouts.

Exactly what Let's Encrypt was reporting.

Why This Was So Difficult To Find

The issue hid behind several misleading signals.

The application was healthy.

The reverse proxy was healthy.

DNS was correct.

Ports were open.

The firewall was configured properly.

Every layer looked healthy when viewed independently.

The actual failure existed underneath all of them.

Most deployment troubleshooting guides focus on application configuration, reverse proxies, certificates, and firewall rules. Very few immediately point you toward route selection.

Especially when the server appears to be functioning normally.

The Fix

Once the routing issue was identified, the fix itself was straightforward.

The server needed to use the public interface for internet-bound traffic instead of attempting to route those responses through the private network.

After correcting the routing configuration, I verified the result.

ip route get 8.8.8.8

The output now showed traffic leaving through the public interface.

Exactly what I wanted.

I restarted Caddy and triggered another validation attempt.

This time the validators connected successfully, the challenge completed, and the certificate was issued within seconds.

Hours of troubleshooting ultimately came down to a routing decision that Linux was making automatically.

Lessons Learned

A few takeaways from this incident stood out.

Error messages often describe symptoms, not causes

The logs repeatedly suggested firewall issues.

The firewall was never the problem.

Stop retrying and start investigating

I hit Let's Encrypt's authorization limits because I kept retrying before understanding the failure.

That was entirely avoidable.

Packet captures reveal reality

When logs become confusing, tcpdump often provides a much clearer picture of what is actually happening on the network.

Multi-interface servers deserve extra scrutiny

If a server has both public and private interfaces, route selection should be one of the first things you verify.

Two commands can save hours

If you're debugging unexplained connectivity issues, run these early:

ip route

ip route get 8.8.8.8

Those two commands exposed the real problem faster than everything else I tried.

Final Thoughts

I started this investigation convinced I had a TLS problem.

Then I thought it was DNS.

Then I suspected the firewall.

Then I questioned my reverse proxy configuration.

In the end, none of those were responsible.

The real issue was a routing decision happening at the operating system level long before the request ever reached my application.

And like most memorable debugging sessions, the hardest part wasn't fixing the problem.

It was figuring out where the problem actually lived.

推荐订阅源

DEV Community