We have had a legacy CI/CD pipeline that had a yield time for a build that equalled the time taken by earth to revolve around its own axis. In order to simplify some of the pain that developer would face on a daily basis due to this legacy pipeline, we started rebuilding the build infra to leverage certain best practices and moved away from the monorepo infra into a multi-repo structure. And while doing this, we moved to a docker registry that was part of the IT managed infrastructure so that we didn’t have to spend time and effort in managing the registry on our own.
Like any good software engineer would do, we did this close to a weekend. I mean, why not? Rest of the world has the same principle for production release 😆
Monday morning, we were welcomed with a flurry of issues from the QA teams indicating that
the upgrade workflow was failing constantly and all signs of failure was pointing to an issue
where docker pull
of the images required to perform the upgrade were constantly timing out.
Background
Before I go any further, I would like you set the context straight and give you some background details so that you can follow this blog along with me easily.
- I am part of a platform development team that is responsible for building an on-premise
kubernetes
cluster - We provide in-place upgrade of everything we develop/provide/extend. (i.e. Linux Kernel, Kubernetes Control Plane etc)
- We run a docker registry inside the cluster to enable use cases such as air-gap where the cluster can’t really reach out to an external network to pull artifacts.
- We we using a locally managed upstream docker registry for all the CI/CD use-cases and test pipelines were leveraging the same infrastructure.
- We run a DNS Service on the cluster itself that leverages a link-local IP to provide a node-local DNS behavior.
So, you will see most of the DNS gets resolved against
169.254.20.10
in this blog post.
The Symptoms
On most of the clusters, when the docker images required for upgrade were being pulled from the new docker registry that we started using with the migrated CI/CD infrastructure was timing out 8/10 times.
[Fri Dec 06 05:58:11 UTC] tardis@10.106.176.205 (tardis-master-1) ~
$ docker pull dockerhub.company.com/tardis-docker/some-image:1.5.38
Error response from daemon: Get https://dockerhub.company.com/v2/: dial tcp: lookup dockerhub.company.com on 169.254.20.10:53: read udp 169.254.20.10:49039->169.254.20.10:53: i/o timeout
Early Diagnosis
As any good software engineer would do, I didn’t trust the reports and errors shared by someone. I managed to get my hands on the cluster and did a quick check for a few things to ensure there weren’t any fundamental issues in the setup itself that was leading to this issue.
- Check if the upstream docker registry was actually reachable.
ping
/dig
were good enough tools for this check. - Check if the DNS service on the host was mis-configured that lead to some issue in resolving the names for the upstream docker registry that might have caused this issue.
# Ping check to ensure basic reachability
[Fri Dec 06 05:27:04 UTC] tardis@10.106.176.205 (tardis-master-1) ~
$ ping dockerhub.company.com -c 3
PING dockerhub.company.com (1.2.3.4) 56(84) bytes of data.
64 bytes from bgl-wwpl-prxy2.company.com (1.2.3.4): icmp_seq=1 ttl=56 time=4.42 ms
64 bytes from bgl-wwpl-prxy2.company.com (1.2.3.4): icmp_seq=2 ttl=56 time=3.20 ms
64 bytes from bgl-wwpl-prxy2.company.com (1.2.3.4): icmp_seq=3 ttl=56 time=2.56 ms
--- dockerhub.company.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 2.564/3.398/4.422/0.770 ms
Validate CoreDNS configuration to make sure they don’t have any invalid settings
# CoreDNS ConfigMap for the corefile
.:53 {
errors
health 169.254.20.10:8080
kubernetes cluster.local 10.24.0.0/16 10.23.0.0/16 {
pods insecure
upstream
}
forward . 7.8.8.7
cache 30
loop
reload
loadbalance
}
To ensure the validity of some of these, I did a few similar tests on my MacBook as well and the docker pull
for
the same some-image:1.5.38
was working properly albeit a bit slow than usual. I didn’t see the point in spending
too much time and handed the cluster back to the QA suggesting they do a retry to see if the issue can be reproduced
before we start looking into the need for a possible fix and what might be causing the issue if at all there is one.
It did not take long for the QA to come back saying that the issue was surfacing again and this time, there was an official Bug opened.
Operation docker pull
Notes
- Enable the CoreDNS logging plugin to get DNS resolution logs. They are your saving grace.
- Understand the CoreDNS log format. Understanding this format will help you sift through the logs easily and find what you are looking for faster.
# CoreDNS Log formatting
{remote}:{port} - {>id} "{type} {class} {name} {proto} {size} {>do} {>bufsize}" {rcode} {>rflags} {rsize} {duration}
Check #1
Since the docker pull
was consistently timing out, I wanted to check if the behavior remains the same for any other ops.
So I went ahead and tried a few curl
commands against the upstream registry to fetch the available tags. While doing this,
I started collecting the logs on the CoreDNS pod to ensure I keep an eye on the state of affairs on the other side of the world.
[Fri Dec 06 05:25:54 UTC] tardis@10.106.176.205 (tardis-master-1) ~
$ curl https://dockerhub.company.com/v2/tardis-docker/tardis-catalog/tags/list
{
"name" : "tardis-docker/tardis-catalog",
"tags" : [ "1.5.10", "1.5.11", "1.5.12", "1.5.13", "1.5.14", "1.5.15", "1.5.16", "1.5.17", "1.5.18", "1.5.19", "1.5.2", "1.5.20", "1.5.21", "1.5.3", "1.5.4", "1.5.5", "1.5.6", "1.5.7", "1.5.8", "1.5.9" ],
"errors" : null
}
CoreDNS Logs
2019-12-06T05:29:20.097Z [INFO] 169.254.20.10:50269 - 23709 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,tc,rd,ra 839 0.004455923s
2019-12-06T05:29:20.099Z [INFO] 169.254.20.10:50269 - 25849 "AAAA IN dockerhub.company.com. udp 37 false 512" NOERROR qr,rd,ra 37 0.005162838s # ignore this. Not required for IPV4
2019-12-06T05:29:20.105Z [INFO] 169.254.20.10:43158 - 23709 "A IN dockerhub.company.com. tcp 37 false 65535" NOERROR qr,rd,ra 1667 0.007212426s
2019-12-06T05:29:20.105Z [INFO] 169.254.20.10:43158 - 25849 "AAAA IN dockerhub.company.com. tcp 37 false 65535" NOERROR qr,aa,rd,ra 37 0.000104123s # ignore this. Not required for IPV4
Request #1: qr,tc,rd,ra These flags indicate that there was a Truncation(tc flag) that was performed because
the receive buffer size(512 bytes) was lower than the response (839 bytes).
This detail is now picked up by the client and consider the tc
flag was set, now the client performs a
tcp
based request instead of a udp
with a receive buffer size set to 65535bytes.
In this case, the actual response size is 1667 and hence the tc
header won’t be set.
This means that the client will not have to worry about the truncated response cases.
Once the client (curl in this case) manages to get it’s hand on the actual IP of the name, then it establishes a connection to that IP at port 443 in this case.
root@tardis-master-1:/home/tardis# conntrack -E -p tcp --dport 443 --orig-dst 1.2.3.4
[NEW] tcp 6 120 SYN_SENT src=10.106.176.205 dst=1.2.3.4 sport=55174 dport=443 [UNREPLIED] src=1.2.3.4 dst=10.106.176.205 sport=443 dport=55174
[UPDATE] tcp 6 60 SYN_RECV src=10.106.176.205 dst=1.2.3.4 sport=55174 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=55174
[UPDATE] tcp 6 86400 ESTABLISHED src=10.106.176.205 dst=1.2.3.4 sport=55174 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=55174 [ASSURED]
[UPDATE] tcp 6 120 FIN_WAIT src=10.106.176.205 dst=1.2.3.4 sport=55174 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=55174 [ASSURED]
[UPDATE] tcp 6 10 CLOSE src=10.106.176.205 dst=1.2.3.4 sport=55174 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=55174 [ASSURED]
[DESTROY] tcp 6 src=10.106.176.205 dst=1.2.3.4 sport=55174 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=55174 [ASSURED]
Now this is good enough a detail for us to confirm that the reachability and accessiblity of the
dockerhub.company.com
is not the issue that is causing thedokcer pull
to fail.
Check #2
Considering Check #1 did not fail, it was time to start looking into why the docker pull
command was actually having
any issue. Only difference this time was, instead of curl
ing the API, I would be doing a docker pull
against one of the
available images in the registry.
[Fri Dec 06 05:58:11 UTC] tardis@10.106.176.205 (tardis-master-1) ~
$ docker pull dockerhub.company.com/tardis-docker/some-image:1.5.38
Error response from daemon: Get https://dockerhub.company.com/v2/: dial tcp: lookup dockerhub.company.com on 169.254.20.10:53: read udp 169.254.20.10:49039->169.254.20.10:53: i/o timeout
Let us verify the conntrack
entries to see what was happening/happened during the docker pull
root@tardis-master-1:/home/tardis# conntrack -E -p tcp --dport 443 --orig-dst 1.2.3.4
CoreDNS logs during the docker pull
operations
2019-12-06T06:05:17.457Z [INFO] 169.254.20.10:44526 - 16575 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,tc,rd,ra 771 0.028012678s
2019-12-06T06:05:17.457Z [INFO] 169.254.20.10:56598 - 26043 "AAAA IN dockerhub.company.com. udp 37 false 512" NOERROR qr,rd,ra 37 0.028323846s
2019-12-06T06:05:17.525Z [INFO] 169.254.20.10:43196 - 56518 "A IN dockerhub.company.com. tcp 37 false 65535" NOERROR qr,rd,ra 1667 0.067541454s
2019-12-06T06:05:17.527Z [INFO] 169.254.20.10:35321 - 46276 "AAAA IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,rd,ra 37 0.000340633s
2019-12-06T06:05:17.527Z [INFO] 169.254.20.10:49038 - 11125 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,tc,rd,ra 1667 0.000509809s
2019-12-06T06:05:22.527Z [INFO] 169.254.20.10:42834 - 51654 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,tc,rd,ra 1667 0.000394149s
2019-12-06T06:05:27.579Z [INFO] 169.254.20.10:44374 - 4021 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,tc,rd,ra 1667 0.000393464s
2019-12-06T06:05:27.629Z [INFO] 169.254.20.10:44237 - 31303 "AAAA IN dockerhub.company.com. udp 37 false 512" NOERROR qr,rd,ra 37 0.050828536s
2019-12-06T06:05:32.579Z [INFO] 169.254.20.10:56485 - 55782 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,tc,rd,ra 1667 0.000484152s
2019-12-06T06:05:37.603Z [INFO] 169.254.20.10:55158 - 51922 "AAAA IN dockerhub.company.com. udp 37 false 512" NOERROR qr,rd,ra 37 0.021003385s
2019-12-06T06:05:37.604Z [INFO] 169.254.20.10:45951 - 38036 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,tc,rd,ra 838 0.021930555s
2019-12-06T06:05:37.646Z [INFO] 169.254.20.10:43448 - 15671 "A IN dockerhub.company.com. tcp 37 false 65535" NOERROR qr,rd,ra 1667 0.041151161s
2019-12-06T06:05:37.647Z [INFO] 169.254.20.10:59299 - 31793 "AAAA IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,rd,ra 37 0.000206121s
2019-12-06T06:05:37.647Z [INFO] 169.254.20.10:60743 - 58335 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,tc,rd,ra 1667 0.000393036s
2019-12-06T06:05:42.648Z [INFO] 169.254.20.10:59304 - 2860 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,tc,rd,ra 1667 0.000374644s
So that is one too many Queries being sent to the DNS server. What about TCPDump?
root@tardis-master-1:/home/tardis# tcpdump udp and port 53 -i lo -n -t | grep -a1 dockerhub.company.com
IP 169.254.20.10.57175 > 169.254.20.10.53: 21489+ A? dockerhub.company.com.cluster.local. (51)
IP 169.254.20.10.57315 > 169.254.20.10.53: 2120+ AAAA? dockerhub.company.com.cluster.local. (51)
IP 169.254.20.10.53 > 169.254.20.10.57315: 2120 NXDomain*- 0/1/0 (144)
IP 169.254.20.10.53 > 169.254.20.10.57175: 21489 NXDomain*- 0/1/0 (144)
IP 169.254.20.10.35290 > 169.254.20.10.53: 2586+ A? dockerhub.company.com.local. (43)
IP 169.254.20.10.46015 > 169.254.20.10.53: 24921+ AAAA? dockerhub.company.com.local. (43)
IP 169.254.20.10.53 > 169.254.20.10.46015: 24921 NXDomain* 0/1/0 (118)
IP 169.254.20.10.53 > 169.254.20.10.35290: 2586 NXDomain* 0/1/0 (118)
IP 169.254.20.10.45951 > 169.254.20.10.53: 38036+ A? dockerhub.company.com. (37)
IP 169.254.20.10.55158 > 169.254.20.10.53: 51922+ AAAA? dockerhub.company.com. (37)
IP 169.254.20.10.53 > 169.254.20.10.55158: 51922 0/0/0 (37)
IP 169.254.20.10.53 > 169.254.20.10.45951: 38036| 1/11/0 A 1.2.3.4 (511)
IP 169.254.20.10.59299 > 169.254.20.10.53: 31793+ AAAA? dockerhub.company.com. (37)
IP 169.254.20.10.60743 > 169.254.20.10.53: 58335+ A? dockerhub.company.com. (37)
IP 169.254.20.10.53 > 169.254.20.10.59299: 31793* 0/0/0 (37)
IP 169.254.20.10.53 > 169.254.20.10.60743: 58335*| 0/13/0 (581)
IP 169.254.20.10.59304 > 169.254.20.10.53: 2860+ A? dockerhub.company.com. (37)
IP 169.254.20.10.53 > 169.254.20.10.59304: 2860*| 0/13/0 (581)
IP 169.254.20.10.52327 > 169.254.20.10.53: 39346+ A? dockerhub.company.com.default.svc.cluster.local. (63)
IP 169.254.20.10.43270 > 169.254.20.10.53: 2452+ AAAA? dockerhub.company.com.default.svc.cluster.local. (63)
IP 169.254.20.10.53 > 169.254.20.10.43270: 2452 NXDomain*- 0/1/0 (156)
IP 169.254.20.10.53 > 169.254.20.10.52327: 39346 NXDomain*- 0/1/0 (156)
IP 169.254.20.10.60913 > 169.254.20.10.53: 39780+ A? dockerhub.company.com.tardis-system.svc.cluster.local. (69)
IP 169.254.20.10.35227 > 169.254.20.10.53: 8325+ AAAA? dockerhub.company.com.tardis-system.svc.cluster.local. (69)
IP 169.254.20.10.53 > 169.254.20.10.35227: 8325 NXDomain*- 0/1/0 (162)
IP 169.254.20.10.53 > 169.254.20.10.60913: 39780 NXDomain*- 0/1/0 (162)
IP 169.254.20.10.36880 > 169.254.20.10.53: 14892+ A? dockerhub.company.com.svc.cluster.local. (55)
IP 169.254.20.10.59149 > 169.254.20.10.53: 17416+ AAAA? dockerhub.company.com.svc.cluster.local. (55)
IP 169.254.20.10.53 > 169.254.20.10.59149: 17416 NXDomain*- 0/1/0 (148)
IP 169.254.20.10.53 > 169.254.20.10.36880: 14892 NXDomain*- 0/1/0 (148)
IP 169.254.20.10.44019 > 169.254.20.10.53: 21987+ A? dockerhub.company.com.cluster.local. (51)
IP 169.254.20.10.46408 > 169.254.20.10.53: 204+ AAAA? dockerhub.company.com.cluster.local. (51)
IP 169.254.20.10.53 > 169.254.20.10.44019: 21987 NXDomain*- 0/1/0 (144)
IP 169.254.20.10.53 > 169.254.20.10.46408: 204 NXDomain*- 0/1/0 (144)
IP 169.254.20.10.50775 > 169.254.20.10.53: 4835+ A? dockerhub.company.com.local. (43)
IP 169.254.20.10.42858 > 169.254.20.10.53: 63875+ AAAA? dockerhub.company.com.local. (43)
IP 169.254.20.10.53 > 169.254.20.10.42858: 63875 NXDomain* 0/1/0 (118)
root@tardis-master-1:/home/tardis# tcpdump tcp and port 53 -i lo -n -t
--
IP 169.254.20.10.45560 > 169.254.20.10.53: Flags [.], ack 1, win 342, options [nop,nop,TS val 224381706 ecr 224381706], length 0
IP 169.254.20.10.45560 > 169.254.20.10.53: Flags [P.], seq 1:40, ack 1, win 342, options [nop,nop,TS val 224381706 ecr 224381706], length 3912988+ A? dockerhub.company.com. (37)
IP 169.254.20.10.53 > 169.254.20.10.45560: Flags [.], ack 40, win 342, options [nop,nop,TS val 224381706 ecr 224381706], length 0
--
IP 169.254.20.10.45622 > 169.254.20.10.53: Flags [.], ack 1, win 342, options [nop,nop,TS val 224386767 ecr 224386767], length 0
IP 169.254.20.10.45622 > 169.254.20.10.53: Flags [P.], seq 1:40, ack 1, win 342, options [nop,nop,TS val 224386767 ecr 224386767], length 3964183+ A? dockerhub.company.com. (37)
IP 169.254.20.10.53 > 169.254.20.10.45622: Flags [.], ack 40, win 342, options [nop,nop,TS val 224386767 ecr 224386767], length 0
--
IP 169.254.20.10.45872 > 169.254.20.10.53: Flags [.], ack 1, win 342, options [nop,nop,TS val 224406894 ecr 224406894], length 0
IP 169.254.20.10.45872 > 169.254.20.10.53: Flags [P.], seq 1:40, ack 1, win 342, options [nop,nop,TS val 224406895 ecr 224406894], length 3927093+ A? dockerhub.company.com. (37)
Okay. So we can see that there are multiple tcp based requests based made to peform an A
record query
against the DNS as we noticed in the coredns logs. So what might be causing this issue?
Turns into this is a normal behavior for dockerd.
// File: vendor/github.com/docker/libnetwork/resolver.go
resp, err = co.ReadMsg()
// Truncated DNS replies should be sent to the client so that the
// client can retry over TCP
if err != nil && err != dns.ErrTruncated {
r.forwardQueryEnd()
logrus.Debugf("[resolver] read from DNS server failed, %s", err)
continue
}
r.forwardQueryEnd()
So this causes the sequence of quereis to be triggered against DNS server for the A and AAAA records. And the request eventually times out before it could actually achieve the final required IP address that it can use for performing the actual connection establishment and that is why the following error.
169.254.20.10:49039->169.254.20.10:53: i/o timeout
Indicating that the request to the DNS Server timed out after a certain interval.
The Remedy
Turns out the resolv.conf already has a fix for this and it is a simple one called edns0
as per the RFC-2671.
This changes the default request behavior that will be used while performing the queries against the DNS
server while requesting the A and AAAA records and it turns out to be super easy to configure.
options edsn0
Add the above at the bottom of the /etc/resolv.conf
file and that is all you need to do.
And you also need to restart the docker daemon so that this new configuration is honored by docker.
(Turns out docker was supporting this behavior all along with it’s libnetwork utility)
root@tardis-master-1:/home/tardis# cat /etc/resolvconf/resolv.conf.d/base
#---------------------------------------------------------------------
# Modified by tardis: Tue, 03 Sep 2019 09:03:14 UTC
# tardis-config 1716
#---------------------------------------------------------------------
search default.svc.cluster.local tardis-system.svc.cluster.local svc.cluster.local cluster.local local
options edns0
root@tardis-master-1:/home/tardis# cat /etc/resolv.conf
nameserver 169.254.20.10
search default.svc.cluster.local tardis-system.svc.cluster.local svc.cluster.local cluster.local local
options edns0
Check 1
Let’s pull the images again and see how do they perform now.
[Fri Dec 06 06:26:09 UTC] tardis@10.106.176.205 (tardis-master-1) ~
$ docker pull dockerhub.company.com/tardis-docker/some-image:1.5.38
1.5.38: Pulling from tardis-docker/some-image
f2233041f557: Already exists
f321bcc6a76c: Already exists
2f25d8d1d058: Already exists
87bfe0d2f0e8: Already exists
145c1bf7947a: Already exists
ea0a6f484557: Already exists
4d03226aa6f1: Already exists
79cf72173686: Already exists
31864de8b88a: Already exists
Good, now that docker has started pulling the image, let’s quickly check the status of the tcpdump, conntrack and coredns logs to see how they behave differently now.
2019-12-06T06:28:04.119Z [INFO] 169.254.20.10:49622 - 9335 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 37 0.000263135s
2019-12-06T06:28:04.119Z [INFO] 169.254.20.10:49622 - 43254 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000332536s
2019-12-06T06:28:04.272Z [INFO] 169.254.20.10:40189 - 15099 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 37 0.000130516s
2019-12-06T06:28:04.272Z [INFO] 169.254.20.10:40189 - 4122 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000219738s
2019-12-06T06:28:27.429Z [INFO] 169.254.20.10:58124 - 12003 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,rd,ra 1678 0.025104794s
2019-12-06T06:28:27.431Z [INFO] 169.254.20.10:58124 - 55950 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,rd,ra 48 0.02649031s
2019-12-06T06:28:28.812Z [INFO] 169.254.20.10:48714 - 46081 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 37 0.000123496s
2019-12-06T06:28:28.812Z [INFO] 169.254.20.10:48714 - 45597 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000331035s
2019-12-06T06:28:29.201Z [INFO] 169.254.20.10:58678 - 27283 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 37 0.000140919s
2019-12-06T06:28:29.201Z [INFO] 169.254.20.10:58678 - 1256 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000179996s
2019-12-06T06:28:30.101Z [INFO] 169.254.20.10:50055 - 54440 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 37 0.000185493s
2019-12-06T06:28:30.101Z [INFO] 169.254.20.10:50055 - 39023 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000267243s
2019-12-06T06:28:31.120Z [INFO] 169.254.20.10:46638 - 29248 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 37 0.000139373s
2019-12-06T06:28:31.120Z [INFO] 169.254.20.10:46638 - 30294 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000198622s
2019-12-06T06:28:32.007Z [INFO] 169.254.20.10:57405 - 41092 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 37 0.000109586s
2019-12-06T06:28:32.007Z [INFO] 169.254.20.10:57405 - 21216 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000135747s
If you notice above there are no more tcp requests being done for A or AAAA records anymore. SO what chagned?
If you look at the below logs.
2019-12-06T06:28:04.119Z [INFO] 169.254.20.10:49622 - 43254 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000332536s
you notice that the receive buffer is not 1024 bytes instead of the previous 512 bytes and the response
and qr,aa,rd,ra flags indicate that there is no tc
flag set anymore.
So no more truncation. And we also notice that there is an AuthoritativeAnswer(aa) flag set
indicating that we have successfully namaged to resolve the name to an IP address and dockerd
can no go ahead and take over to setup the required connection and start pulling things.
[UPDATE] tcp 6 120 FIN_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44208 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44208 [ASSURED]
[UPDATE] tcp 6 3600 CLOSE_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44208 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44208 [ASSURED]
[UPDATE] tcp 6 30 LAST_ACK src=10.106.176.205 dst=1.2.3.4 sport=44208 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44208 [ASSURED]
[UPDATE] tcp 6 10 CLOSE src=10.106.176.205 dst=1.2.3.4 sport=44208 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44208 [ASSURED]
[NEW] tcp 6 120 SYN_SENT src=10.106.176.205 dst=1.2.3.4 sport=44954 dport=443 [UNREPLIED] src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44954
[UPDATE] tcp 6 60 SYN_RECV src=10.106.176.205 dst=1.2.3.4 sport=44954 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44954
[UPDATE] tcp 6 86400 ESTABLISHED src=10.106.176.205 dst=1.2.3.4 sport=44954 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44954 [ASSURED]
[UPDATE] tcp 6 120 FIN_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44678 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44678 [ASSURED]
[UPDATE] tcp 6 300 CLOSE_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44678 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44678 [ASSURED]
[UPDATE] tcp 6 30 LAST_ACK src=10.106.176.205 dst=1.2.3.4 sport=44678 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44678 [ASSURED]
[NEW] tcp 6 120 SYN_SENT src=10.106.176.205 dst=1.2.3.4 sport=44974 dport=443 [UNREPLIED] src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44974
[UPDATE] tcp 6 10 CLOSE src=10.106.176.205 dst=1.2.3.4 sport=44678 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44678 [ASSURED]
[UPDATE] tcp 6 60 SYN_RECV src=10.106.176.205 dst=1.2.3.4 sport=44974 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44974
[UPDATE] tcp 6 86400 ESTABLISHED src=10.106.176.205 dst=1.2.3.4 sport=44974 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44974 [ASSURED]
[UPDATE] tcp 6 120 FIN_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44954 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44954 [ASSURED]
[UPDATE] tcp 6 300 CLOSE_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44954 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44954 [ASSURED]
[UPDATE] tcp 6 30 LAST_ACK src=10.106.176.205 dst=1.2.3.4 sport=44954 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44954 [ASSURED]
[NEW] tcp 6 120 SYN_SENT src=10.106.176.205 dst=1.2.3.4 sport=44978 dport=443 [UNREPLIED] src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44978
[UPDATE] tcp 6 10 CLOSE src=10.106.176.205 dst=1.2.3.4 sport=44954 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44954 [ASSURED]
[UPDATE] tcp 6 60 SYN_RECV src=10.106.176.205 dst=1.2.3.4 sport=44978 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44978
[UPDATE] tcp 6 86400 ESTABLISHED src=10.106.176.205 dst=1.2.3.4 sport=44978 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44978 [ASSURED]
[UPDATE] tcp 6 120 FIN_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44978 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44978 [ASSURED]
[UPDATE] tcp 6 300 CLOSE_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44978 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44978 [ASSURED]
For each layer being pulled we see a sequence of conntrack ops being created now.
root@tardis-master-1:/home/tardis# tcpdump tcp and port 53 -i lo -n -t | grep -a1 dockerhub.company.com
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
No more tcp requests being made to the dns server to resolve dockerhub.company.com
--
IP 169.254.20.10.53 > 169.254.20.10.52960: 47243*- 0/1/0 (151)
IP 169.254.20.10.58124 > 169.254.20.10.53: 12003+ [1au] A? dockerhub.company.com. (48)
IP 169.254.20.10.58124 > 169.254.20.10.53: 55950+ [1au] AAAA? dockerhub.company.com. (48)
IP 169.254.20.10.53 > 169.254.20.10.58124: 12003 1/13/14 A 1.2.3.4 (816)
IP 169.254.20.10.53 > 169.254.20.10.58124: 55950 0/0/1 (48)
IP 169.254.20.10.48714 > 169.254.20.10.53: 45597+ [1au] A? dockerhub.company.com. (48)
IP 169.254.20.10.48714 > 169.254.20.10.53: 46081+ [1au] AAAA? dockerhub.company.com. (48)
IP 169.254.20.10.53 > 169.254.20.10.48714: 46081* 0/0/1 (48)
IP 169.254.20.10.53 > 169.254.20.10.48714: 45597* 1/13/14 A 1.2.3.4 (816)
IP 169.254.20.10.58678 > 169.254.20.10.53: 1256+ [1au] A? dockerhub.company.com. (48)
IP 169.254.20.10.58678 > 169.254.20.10.53: 27283+ [1au] AAAA? dockerhub.company.com. (48)
IP 169.254.20.10.53 > 169.254.20.10.58678: 27283* 0/0/1 (48)
IP 169.254.20.10.53 > 169.254.20.10.58678: 1256* 1/13/14 A 1.2.3.4 (816)
IP 169.254.20.10.50055 > 169.254.20.10.53: 39023+ [1au] A? dockerhub.company.com. (48)
IP 169.254.20.10.50055 > 169.254.20.10.53: 54440+ [1au] AAAA? dockerhub.company.com. (48)
IP 169.254.20.10.53 > 169.254.20.10.50055: 54440* 0/0/1 (48)
IP 169.254.20.10.53 > 169.254.20.10.50055: 39023* 1/13/14 A 1.2.3.4 (816)
IP 169.254.20.10.46638 > 169.254.20.10.53: 30294+ [1au] A? dockerhub.company.com. (48)
IP 169.254.20.10.46638 > 169.254.20.10.53: 29248+ [1au] AAAA? dockerhub.company.com. (48)
IP 169.254.20.10.53 > 169.254.20.10.46638: 29248* 0/0/1 (48)
And no more calls to the DNS server for A or AAAA record in udp mode by extending the items out of search domains since we already have an authoritative answer.