






















I’d been wrestling with the problem of accessing AWS EKS from the office for a long time – finally lost my patience and figured it out 🙂
Here’s the problem: there’s an AWS EKS cluster with both Public and Private endpoints for the API.
Working from my office laptop, sometimes requests to it go through fine – and sometimes they die with an “i/o timeout” error:
$ kk get pod [...] Get \"https://F07***D78.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": dial tcp 10.0.64.9:443: i/o timeout" ...
Let’s go digging – because there are nuances here both with DNS and with network routes.
Contents
In my case EKS has both Public and Private endpoints enabled – so DNS resolution uses split-horizon DNS:
Next: I have two active VPN connections + the office WiFi, and the problem starts when I add AWS VPC DNS to resolv.conf, because:
In resolv.conf it looks like this:
nameserver 1.1.1.1 # CloudFlare DNS, returns EKS Endpoint Public IP nameserver 10.100.0.1 # my MikroTik with WireGuard, returns EKS Endpoint Public IP nameserver 10.0.0.2 # AWS VPC DNS via OpenVPN, returns EKS Endpoint Private IP EKS 10.0.64.9
The file is managed by openresolv, which WireGuard launches when the tunnel starts – WireGuard sets its own DNS:
$ sudo cat /etc/wireguard/wg0.conf ... DNS = 10.100.0.1, 10.0.0.2, 192.168.0.1 ...
In the timeout error we can see that the request to F07***D78.gr7.us-east-1.eks.amazonaws.com goes to IP 10.0.64.9 – meaning DNS resolution went through OpenVPN and AWS VPC DNS 10.0.0.2.
Let’s check who’s actually responsible for DNS in the system – grep the /etc/nsswitch.conf file:
$ grep hosts /etc/nsswitch.conf hosts: mymachines resolve [!UNAVAIL=return] files myhostname dns
Here the resolve option means using the nss-resolve module over D-Bus to systemd-resolved.
And it comes first, before the files parameters (nss-files and /etc/hosts) and dns (the nss-dns module and the “classic” glibc DNS resolver) – so requests go to systemd-resolved first.
See Domain name resolution on the Arch Wiki.
Now the interesting part – exactly how systemd-resolved performs DNS resolution.
systemd-resolved uses openresolv – let’s look at its parameters:
$ resolvectl status
Global
Protocols: +LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: foreign
Current DNS Server: 10.0.0.2
DNS Servers: 10.0.0.2 10.100.0.1 192.168.0.1
...
Next, let’s check what’s happening in the system – enable the debug log for openresolv:
$ sudo resolvectl log-level debug
Then in one window we open the logs:
$ sudo journalctl -u systemd-resolved -f | grep "F07***D78.gr7.us-east-1.eks.amazonaws.com"
Run kubectl get pod – and in the logs we see:
...
May 20 12:06:39 setevoy-office systemd-resolved[698]: varlink-28-28: Received message: {"method":"io.systemd.Resolve.ResolveHostname","parameters":{"name":"F07***D78.gr7.us-east-1.eks.amazonaws.com","flags":0,"ifindex":0}}
...
May 20 12:06:39 setevoy-office systemd-resolved[698]: varlink-28-28: Sending message: {"parameters":{"addresses":[{"ifindex":6,"family":2,"address":[10,0,64,9]},{"ifindex":6,"family":2,"address":[10,0,65,205]}],"name":"F07***D78.gr7.us-east-1.eks.amazonaws.com","flags":1048577}}
Here:
"ifindex":0 – “don’t care where to look“"ifindex":6 – that’s tun0, OpenVPN and AWS VPC DNSLet’s check the interfaces:
$ ip -o link | awk -F': ' '{print $1, $2}'
1 lo
2 enp0s31f6
4 wlan0
5 enp0s13f0u3u4u4
6 tun0
...
"ifindex":6 is the tun0 interface, the work OpenVPN, and the result returned from AWS VPC DNS – "address":[10,0,64,9], because AWS VPC DNS returns a private address.
We repeat the request – and now the result is different:
...
varlink-28-28: Sending message: {"parameters":{"addresses":[{"ifindex":4,"family":2,"address":[44,216,7,46]},{"ifindex":4,"family":2,"address":[3,***,***,161]}],"name":"F07***D78.gr7.us-east-1.eks.amazonaws.com","flags":8388609}}
...
This time the response is from ifindex":4 – wlan0, and we get a public IP.
Why – because in the same log we see:
... Firing regular transaction 49587 ... IN A> scope dns on */* Firing regular transaction 59798 ... IN A> scope dns on wlan0/* ...
Here the first entry is a request through the global pool, to all the servers in it:
$ resolvectl status
Global
Protocols: +LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: foreign
Current DNS Server: 10.0.0.2
DNS Servers: 10.0.0.2 10.100.0.1 192.168.0.1
...
And the result comes back from whoever answers first, see systemd-resolved.service:
If lookups are routed to multiple interfaces, the first successful response is returned
And in this case it was wlan0:
... Added positive ... cache entry ... on wlan0/INET/10.0.0.1 ...
Since the request went through wlan0 – the response from AWS DNS for the EKS endpoint was a public IP.
While on the first attempt it was tun0:
... Added positive ... cache entry ... on tun0/INET/10.0.0.2 ...
And in response we got the private IP 10.0.64.9.
So:
systemd-resolved queries all available DNS serverswlan0, the office network – we get a public IP, and the connection goes throughtun0, OpenVPN and AWS VPC DNS – we get a private IP, and the connection fails with a timeoutDon’t forget to set the log level back to info:
$ sudo resolvectl log-level info
Now let’s move on to routing – why exactly does the connection fail with a timeout error?
Let’s look at the routes on the work laptop:
$ route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.0.0.1 0.0.0.0 UG 600 0 0 wlan0 10.0.0.0 0.0.0.0 255.255.255.0 U 600 0 0 wlan0 10.0.0.2 172.16.0.1 255.255.255.255 UGH 0 0 0 tun0 10.0.6.162 172.16.0.1 255.255.255.255 UGH 0 0 0 tun0 10.0.32.0 172.16.0.1 255.255.240.0 UG 0 0 0 tun0 10.0.48.0 172.16.0.1 255.255.240.0 UG 0 0 0 tun0 10.0.66.0 172.16.0.1 255.255.255.0 UG 0 0 0 tun0 10.0.67.0 172.16.0.1 255.255.255.0 UG 0 0 0 tun0 10.100.0.0 0.0.0.0 255.255.255.0 U 0 0 0 wg0 ...
Here:
And now – here’s where the problem shows up: when the EKS endpoint resolves through AWS VPC DNS 10.0.0.2 – we get the private address 10.0.64.9.
But there’s no dedicated route for it through OpenVPN – so it gets routed through 10.0.0.1, the office router and the public internet:
$ kk get pod [...] Get \"https://F07***D78.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": dial tcp 10.0.64.9:443: i/o timeout" ...
We check the route itself – and we see it goes via 10.0.0.1, the office router, instead of tun0 and OpenVPN:
$ ip route get 10.0.64.9
10.0.64.9 via 10.0.0.1 dev wlan0 src 10.0.0.133 uid 1000
cache
And of course traceroute:
$ traceroute 10.0.64.9 traceroute to 10.0.64.9 (10.0.64.9), 30 hops max, 60 byte packets 1 office.example.dev (10.0.0.1) 14.261 ms 14.614 ms 14.592 ms 2 * * * 3 * * * ...
And obviously, a request to an address in a private subnet sent through the public internet just dies.
There are a few options – either just add 10.0.64.9 to OpenVPN, or set up split-DNS – and resolve the domains correctly:
tun0 on top of the existing onessystemd-resolveddnsmasq – and switch all DNS queries over to itThe option with 10.0.64.9 on OpenVPN is a hack.
Note: only after I’d written the whole post did I remember that the EKS Control Plane lives in its own VPC Subnets, and I could’ve just added those to OpenVPN the same way it’s done for RDS, but whatever – it turned out interesting anyway 🙂
The split-DNS solution through systemd-resolved looks kind of painful.
And I’d already run Unbound on FreeBSD for my home NAS (see FreeBSD: Home NAS, part 4 – a local DNS with Unbound), the config is simple and clear, and on top of that it kicks systemd-resolved with all its complexities out of the picture – a solid option.
Although dnsmasq might’ve been a better solution for a laptop – because the config is even simpler, but I really liked Unbound – so I went with it.
Install the package itself:
$ sudo pacman -S unbound
What we need to do:
compute.internal (AWS EC2 etc) through OpenVPN and AWS VPC DNSops.example.com, because that’s where we have records for AWS RDS like db.prod.ops.example.comgrafana.net.setevoy through MikroTik, because that’s my local zone for home hostsWe write the /etc/unbound/unbound.conf file, describing three forward-zone blocks with our own DNS and one with public DNS:
server:
interface: 127.0.0.1
access-control: 127.0.0.0/8 allow
do-ip6: no
hide-identity: yes
hide-version: yes
prefetch: yes
# local homelab via MikroTik
forward-zone:
name: "setevoy."
forward-addr: 10.100.0.1
forward-addr: 192.168.0.1
forward-zone:
name: "compute.internal."
forward-addr: 10.0.0.2
forward-zone:
name: "ops.example.com."
forward-addr: 10.0.0.2
# everything else
forward-zone:
name: "."
forward-addr: 1.1.1.1
forward-addr: 8.8.8.8
Check the syntax:
$ sudo unbound-checkconf unbound-checkconf: no errors in /etc/unbound/unbound.conf
In the post Arch Linux: WireGuard Peer for connecting to MikroTik I described a solution to a different problem, and there I added dns=systemd-resolved for NetworkManager.
If it’s there – remove it in /etc/NetworkManager/NetworkManager.conf, just set dns=none:
... [main] dns=none
Disable systemd-resolved (the internet will drop here – because there’s nowhere to send DNS):
$ sudo systemctl disable --now systemd-resolved systemd-resolved-monitor.socket systemd-resolved-varlink.socket
Restart NetworkManager:
$ sudo systemctl restart NetworkManager
Check port 53 – if systemd-resolve is still alive, that means something is triggering its startup:
$ sudo ss -tulpn | grep ':53'
...
tcp LISTEN 0 4096 127.0.0.53%lo:53 0.0.0.0:* users:(("systemd-resolve",pid=723720,fd=25))
tcp LISTEN 0 4096 127.0.0.54:53 0.0.0.0:* users:(("systemd-resolve",pid=723720,fd=27))
You can hard-block its startup with systemctl mask:
$ sudo systemctl mask systemd-resolved $ sudo systemctl stop systemd-resolved
Check the ports once more, and if there’s no longer anyone on port 53 – start unbound.service:
$ sudo systemctl stop systemd-resolved
$ sudo systemctl enable --now unbound
Created symlink '/etc/systemd/system/multi-user.target.wants/unbound.service' → '/usr/lib/systemd/system/unbound.service'.
$ sudo ss -tulpn | grep ':53\b'
udp UNCONN 0 0 127.0.0.1:53 0.0.0.0:* users:(("unbound",pid=727532,fd=3))
tcp LISTEN 0 256 127.0.0.1:53 0.0.0.0:* users:(("unbound",pid=727532,fd=4))
Edit /etc/resolv.conf – point all DNS through it:
nameserver 127.0.0.1
And try something public:
$ dig google.com +short 216.58.207.14
Then the EKS endpoint – it should return public IPs:
$ dig F07***D78.gr7.us-east-1.eks.amazonaws.com +short 3.***.***.161 44.***.***.46
Try RDS – it should return private IPs from the VPC pool:
$ dig prod.db.kraken.ops.example.com +short kraken-ops-rds-prod.***.us-east-1.rds.amazonaws.com. 10.0.66.14
Edit the WireGuard /etc/wireguard/wg0.conf – change the DNS parameter:
[Interface] ... DNS = 127.0.0.1 ...
Run sudo resolvconf -u, since we made changes to /etc/resolv.conf manually and WireGuard will complain.
Restart WireGuard:
$ sudo wg-quick down wg0 && sudo wg-quick up wg0
Check the file:
$ cat /etc/resolv.conf # Generated by resolvconf nameserver 127.0.0.1
And now everything works as it should.
![]()
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。