Curious Case of Docker Pull Timeout

We have had a legacy CI/CD pipeline that had a yield time for a build that equalled the time taken by earth to revolve around its own axis. In order to simplify some of the pain that developer would face on a daily basis due to this legacy pipeline, we started rebuilding the build infra to leverage certain best practices and moved away from the monorepo infra into a multi-repo structure. And while doing this, we moved to a docker registry that was part of the IT managed infrastructure so that we didn’t have to spend time and effort in managing the registry on our own.

Like any good software engineer would do, we did this close to a weekend. I mean, why not? Rest of the world has the same principle for production release 😆

Monday morning, we were welcomed with a flurry of issues from the QA teams indicating that the upgrade workflow was failing constantly and all signs of failure was pointing to an issue where docker pull of the images required to perform the upgrade were constantly timing out.

Background

Before I go any further, I would like you set the context straight and give you some background details so that you can follow this blog along with me easily.

  1. I am part of a platform development team that is responsible for building an on-premise kubernetes cluster
  2. We provide in-place upgrade of everything we develop/provide/extend. (i.e. Linux Kernel, Kubernetes Control Plane etc)
  3. We run a docker registry inside the cluster to enable use cases such as air-gap where the cluster can’t really reach out to an external network to pull artifacts.
  4. We we using a locally managed upstream docker registry for all the CI/CD use-cases and test pipelines were leveraging the same infrastructure.
  5. We run a DNS Service on the cluster itself that leverages a link-local IP to provide a node-local DNS behavior. So, you will see most of the DNS gets resolved against 169.254.20.10 in this blog post.

The Symptoms

On most of the clusters, when the docker images required for upgrade were being pulled from the new docker registry that we started using with the migrated CI/CD infrastructure was timing out 8/10 times.

[Fri Dec 06 05:58:11 UTC] tardis@10.106.176.205 (tardis-master-1) ~
$ docker pull dockerhub.company.com/tardis-docker/some-image:1.5.38
Error response from daemon: Get https://dockerhub.company.com/v2/: dial tcp: lookup dockerhub.company.com on 169.254.20.10:53: read udp 169.254.20.10:49039->169.254.20.10:53: i/o timeout

Early Diagnosis

As any good software engineer would do, I didn’t trust the reports and errors shared by someone. I managed to get my hands on the cluster and did a quick check for a few things to ensure there weren’t any fundamental issues in the setup itself that was leading to this issue.

  1. Check if the upstream docker registry was actually reachable. ping/dig were good enough tools for this check.
  2. Check if the DNS service on the host was mis-configured that lead to some issue in resolving the names for the upstream docker registry that might have caused this issue.
# Ping check to ensure basic reachability

[Fri Dec 06 05:27:04 UTC] tardis@10.106.176.205 (tardis-master-1) ~
$ ping dockerhub.company.com -c 3
PING dockerhub.company.com (1.2.3.4) 56(84) bytes of data.
64 bytes from bgl-wwpl-prxy2.company.com (1.2.3.4): icmp_seq=1 ttl=56 time=4.42 ms
64 bytes from bgl-wwpl-prxy2.company.com (1.2.3.4): icmp_seq=2 ttl=56 time=3.20 ms
64 bytes from bgl-wwpl-prxy2.company.com (1.2.3.4): icmp_seq=3 ttl=56 time=2.56 ms

--- dockerhub.company.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 2.564/3.398/4.422/0.770 ms

Validate CoreDNS configuration to make sure they don’t have any invalid settings

# CoreDNS ConfigMap for the corefile

.:53 {
    errors
    health 169.254.20.10:8080
    kubernetes cluster.local 10.24.0.0/16 10.23.0.0/16 {
       pods insecure
       upstream
    }
    forward . 7.8.8.7
    cache 30
    loop
    reload
    loadbalance
}

To ensure the validity of some of these, I did a few similar tests on my MacBook as well and the docker pull for the same some-image:1.5.38 was working properly albeit a bit slow than usual. I didn’t see the point in spending too much time and handed the cluster back to the QA suggesting they do a retry to see if the issue can be reproduced before we start looking into the need for a possible fix and what might be causing the issue if at all there is one.

It did not take long for the QA to come back saying that the issue was surfacing again and this time, there was an official Bug opened.

Operation docker pull

Notes

  1. Enable the CoreDNS logging plugin to get DNS resolution logs. They are your saving grace.
  2. Understand the CoreDNS log format. Understanding this format will help you sift through the logs easily and find what you are looking for faster.
# CoreDNS Log formatting

{remote}:{port} - {>id} "{type} {class} {name} {proto} {size} {>do} {>bufsize}" {rcode} {>rflags} {rsize} {duration}

Check #1

Since the docker pull was consistently timing out, I wanted to check if the behavior remains the same for any other ops. So I went ahead and tried a few curl commands against the upstream registry to fetch the available tags. While doing this, I started collecting the logs on the CoreDNS pod to ensure I keep an eye on the state of affairs on the other side of the world.

[Fri Dec 06 05:25:54 UTC] tardis@10.106.176.205 (tardis-master-1) ~
$ curl https://dockerhub.company.com/v2/tardis-docker/tardis-catalog/tags/list
{
  "name" : "tardis-docker/tardis-catalog",
  "tags" : [ "1.5.10", "1.5.11", "1.5.12", "1.5.13", "1.5.14", "1.5.15", "1.5.16", "1.5.17", "1.5.18", "1.5.19", "1.5.2", "1.5.20", "1.5.21", "1.5.3", "1.5.4", "1.5.5", "1.5.6", "1.5.7", "1.5.8", "1.5.9" ],
  "errors" : null
}

CoreDNS Logs

2019-12-06T05:29:20.097Z [INFO] 169.254.20.10:50269 - 23709 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,tc,rd,ra 839 0.004455923s
2019-12-06T05:29:20.099Z [INFO] 169.254.20.10:50269 - 25849 "AAAA IN dockerhub.company.com. udp 37 false 512" NOERROR qr,rd,ra 37 0.005162838s  # ignore this. Not required for IPV4
2019-12-06T05:29:20.105Z [INFO] 169.254.20.10:43158 - 23709 "A IN dockerhub.company.com. tcp 37 false 65535" NOERROR qr,rd,ra 1667 0.007212426s
2019-12-06T05:29:20.105Z [INFO] 169.254.20.10:43158 - 25849 "AAAA IN dockerhub.company.com. tcp 37 false 65535" NOERROR qr,aa,rd,ra 37 0.000104123s # ignore this. Not required for IPV4

Request #1: qr,tc,rd,ra These flags indicate that there was a Truncation(tc flag) that was performed because the receive buffer size(512 bytes) was lower than the response (839 bytes). This detail is now picked up by the client and consider the tc flag was set, now the client performs a tcp based request instead of a udp with a receive buffer size set to 65535bytes. In this case, the actual response size is 1667 and hence the tc header won’t be set. This means that the client will not have to worry about the truncated response cases.

Once the client (curl in this case) manages to get it’s hand on the actual IP of the name, then it establishes a connection to that IP at port 443 in this case.

root@tardis-master-1:/home/tardis# conntrack -E -p tcp --dport 443 --orig-dst 1.2.3.4
    [NEW] tcp      6 120 SYN_SENT src=10.106.176.205 dst=1.2.3.4 sport=55174 dport=443 [UNREPLIED] src=1.2.3.4 dst=10.106.176.205 sport=443 dport=55174
 [UPDATE] tcp      6 60 SYN_RECV src=10.106.176.205 dst=1.2.3.4 sport=55174 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=55174
 [UPDATE] tcp      6 86400 ESTABLISHED src=10.106.176.205 dst=1.2.3.4 sport=55174 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=55174 [ASSURED]
 [UPDATE] tcp      6 120 FIN_WAIT src=10.106.176.205 dst=1.2.3.4 sport=55174 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=55174 [ASSURED]
 [UPDATE] tcp      6 10 CLOSE src=10.106.176.205 dst=1.2.3.4 sport=55174 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=55174 [ASSURED]
[DESTROY] tcp      6 src=10.106.176.205 dst=1.2.3.4 sport=55174 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=55174 [ASSURED]

Now this is good enough a detail for us to confirm that the reachability and accessiblity of the dockerhub.company.com is not the issue that is causing thedokcer pull to fail.

Check #2

Considering Check #1 did not fail, it was time to start looking into why the docker pull command was actually having any issue. Only difference this time was, instead of curling the API, I would be doing a docker pull against one of the available images in the registry.

[Fri Dec 06 05:58:11 UTC] tardis@10.106.176.205 (tardis-master-1) ~
$ docker pull dockerhub.company.com/tardis-docker/some-image:1.5.38
Error response from daemon: Get https://dockerhub.company.com/v2/: dial tcp: lookup dockerhub.company.com on 169.254.20.10:53: read udp 169.254.20.10:49039->169.254.20.10:53: i/o timeout

Let us verify the conntrack entries to see what was happening/happened during the docker pull

root@tardis-master-1:/home/tardis# conntrack -E -p tcp --dport 443 --orig-dst 1.2.3.4

CoreDNS logs during the docker pull operations

2019-12-06T06:05:17.457Z [INFO] 169.254.20.10:44526 - 16575 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,tc,rd,ra 771 0.028012678s
2019-12-06T06:05:17.457Z [INFO] 169.254.20.10:56598 - 26043 "AAAA IN dockerhub.company.com. udp 37 false 512" NOERROR qr,rd,ra 37 0.028323846s
2019-12-06T06:05:17.525Z [INFO] 169.254.20.10:43196 - 56518 "A IN dockerhub.company.com. tcp 37 false 65535" NOERROR qr,rd,ra 1667 0.067541454s
2019-12-06T06:05:17.527Z [INFO] 169.254.20.10:35321 - 46276 "AAAA IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,rd,ra 37 0.000340633s
2019-12-06T06:05:17.527Z [INFO] 169.254.20.10:49038 - 11125 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,tc,rd,ra 1667 0.000509809s
2019-12-06T06:05:22.527Z [INFO] 169.254.20.10:42834 - 51654 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,tc,rd,ra 1667 0.000394149s
2019-12-06T06:05:27.579Z [INFO] 169.254.20.10:44374 - 4021 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,tc,rd,ra 1667 0.000393464s
2019-12-06T06:05:27.629Z [INFO] 169.254.20.10:44237 - 31303 "AAAA IN dockerhub.company.com. udp 37 false 512" NOERROR qr,rd,ra 37 0.050828536s
2019-12-06T06:05:32.579Z [INFO] 169.254.20.10:56485 - 55782 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,tc,rd,ra 1667 0.000484152s
2019-12-06T06:05:37.603Z [INFO] 169.254.20.10:55158 - 51922 "AAAA IN dockerhub.company.com. udp 37 false 512" NOERROR qr,rd,ra 37 0.021003385s
2019-12-06T06:05:37.604Z [INFO] 169.254.20.10:45951 - 38036 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,tc,rd,ra 838 0.021930555s
2019-12-06T06:05:37.646Z [INFO] 169.254.20.10:43448 - 15671 "A IN dockerhub.company.com. tcp 37 false 65535" NOERROR qr,rd,ra 1667 0.041151161s
2019-12-06T06:05:37.647Z [INFO] 169.254.20.10:59299 - 31793 "AAAA IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,rd,ra 37 0.000206121s
2019-12-06T06:05:37.647Z [INFO] 169.254.20.10:60743 - 58335 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,tc,rd,ra 1667 0.000393036s
2019-12-06T06:05:42.648Z [INFO] 169.254.20.10:59304 - 2860 "A IN dockerhub.company.com. udp 37 false 512" NOERROR qr,aa,tc,rd,ra 1667 0.000374644s

So that is one too many Queries being sent to the DNS server. What about TCPDump?

root@tardis-master-1:/home/tardis# tcpdump udp and port 53 -i lo -n -t | grep -a1 dockerhub.company.com
IP 169.254.20.10.57175 > 169.254.20.10.53: 21489+ A? dockerhub.company.com.cluster.local. (51)
IP 169.254.20.10.57315 > 169.254.20.10.53: 2120+ AAAA? dockerhub.company.com.cluster.local. (51)
IP 169.254.20.10.53 > 169.254.20.10.57315: 2120 NXDomain*- 0/1/0 (144)
IP 169.254.20.10.53 > 169.254.20.10.57175: 21489 NXDomain*- 0/1/0 (144)
IP 169.254.20.10.35290 > 169.254.20.10.53: 2586+ A? dockerhub.company.com.local. (43)
IP 169.254.20.10.46015 > 169.254.20.10.53: 24921+ AAAA? dockerhub.company.com.local. (43)
IP 169.254.20.10.53 > 169.254.20.10.46015: 24921 NXDomain* 0/1/0 (118)
IP 169.254.20.10.53 > 169.254.20.10.35290: 2586 NXDomain* 0/1/0 (118)
IP 169.254.20.10.45951 > 169.254.20.10.53: 38036+ A? dockerhub.company.com. (37)
IP 169.254.20.10.55158 > 169.254.20.10.53: 51922+ AAAA? dockerhub.company.com. (37)
IP 169.254.20.10.53 > 169.254.20.10.55158: 51922 0/0/0 (37)
IP 169.254.20.10.53 > 169.254.20.10.45951: 38036| 1/11/0 A 1.2.3.4 (511)
IP 169.254.20.10.59299 > 169.254.20.10.53: 31793+ AAAA? dockerhub.company.com. (37)
IP 169.254.20.10.60743 > 169.254.20.10.53: 58335+ A? dockerhub.company.com. (37)
IP 169.254.20.10.53 > 169.254.20.10.59299: 31793* 0/0/0 (37)
IP 169.254.20.10.53 > 169.254.20.10.60743: 58335*| 0/13/0 (581)
IP 169.254.20.10.59304 > 169.254.20.10.53: 2860+ A? dockerhub.company.com. (37)
IP 169.254.20.10.53 > 169.254.20.10.59304: 2860*| 0/13/0 (581)
IP 169.254.20.10.52327 > 169.254.20.10.53: 39346+ A? dockerhub.company.com.default.svc.cluster.local. (63)
IP 169.254.20.10.43270 > 169.254.20.10.53: 2452+ AAAA? dockerhub.company.com.default.svc.cluster.local. (63)
IP 169.254.20.10.53 > 169.254.20.10.43270: 2452 NXDomain*- 0/1/0 (156)
IP 169.254.20.10.53 > 169.254.20.10.52327: 39346 NXDomain*- 0/1/0 (156)
IP 169.254.20.10.60913 > 169.254.20.10.53: 39780+ A? dockerhub.company.com.tardis-system.svc.cluster.local. (69)
IP 169.254.20.10.35227 > 169.254.20.10.53: 8325+ AAAA? dockerhub.company.com.tardis-system.svc.cluster.local. (69)
IP 169.254.20.10.53 > 169.254.20.10.35227: 8325 NXDomain*- 0/1/0 (162)
IP 169.254.20.10.53 > 169.254.20.10.60913: 39780 NXDomain*- 0/1/0 (162)
IP 169.254.20.10.36880 > 169.254.20.10.53: 14892+ A? dockerhub.company.com.svc.cluster.local. (55)
IP 169.254.20.10.59149 > 169.254.20.10.53: 17416+ AAAA? dockerhub.company.com.svc.cluster.local. (55)
IP 169.254.20.10.53 > 169.254.20.10.59149: 17416 NXDomain*- 0/1/0 (148)
IP 169.254.20.10.53 > 169.254.20.10.36880: 14892 NXDomain*- 0/1/0 (148)
IP 169.254.20.10.44019 > 169.254.20.10.53: 21987+ A? dockerhub.company.com.cluster.local. (51)
IP 169.254.20.10.46408 > 169.254.20.10.53: 204+ AAAA? dockerhub.company.com.cluster.local. (51)
IP 169.254.20.10.53 > 169.254.20.10.44019: 21987 NXDomain*- 0/1/0 (144)
IP 169.254.20.10.53 > 169.254.20.10.46408: 204 NXDomain*- 0/1/0 (144)
IP 169.254.20.10.50775 > 169.254.20.10.53: 4835+ A? dockerhub.company.com.local. (43)
IP 169.254.20.10.42858 > 169.254.20.10.53: 63875+ AAAA? dockerhub.company.com.local. (43)
IP 169.254.20.10.53 > 169.254.20.10.42858: 63875 NXDomain* 0/1/0 (118)
root@tardis-master-1:/home/tardis# tcpdump tcp and port 53 -i lo -n -t
--
IP 169.254.20.10.45560 > 169.254.20.10.53: Flags [.], ack 1, win 342, options [nop,nop,TS val 224381706 ecr 224381706], length 0
IP 169.254.20.10.45560 > 169.254.20.10.53: Flags [P.], seq 1:40, ack 1, win 342, options [nop,nop,TS val 224381706 ecr 224381706], length 3912988+ A? dockerhub.company.com. (37)
IP 169.254.20.10.53 > 169.254.20.10.45560: Flags [.], ack 40, win 342, options [nop,nop,TS val 224381706 ecr 224381706], length 0
--
IP 169.254.20.10.45622 > 169.254.20.10.53: Flags [.], ack 1, win 342, options [nop,nop,TS val 224386767 ecr 224386767], length 0
IP 169.254.20.10.45622 > 169.254.20.10.53: Flags [P.], seq 1:40, ack 1, win 342, options [nop,nop,TS val 224386767 ecr 224386767], length 3964183+ A? dockerhub.company.com. (37)
IP 169.254.20.10.53 > 169.254.20.10.45622: Flags [.], ack 40, win 342, options [nop,nop,TS val 224386767 ecr 224386767], length 0
--
IP 169.254.20.10.45872 > 169.254.20.10.53: Flags [.], ack 1, win 342, options [nop,nop,TS val 224406894 ecr 224406894], length 0
IP 169.254.20.10.45872 > 169.254.20.10.53: Flags [P.], seq 1:40, ack 1, win 342, options [nop,nop,TS val 224406895 ecr 224406894], length 3927093+ A? dockerhub.company.com. (37)

Okay. So we can see that there are multiple tcp based requests based made to peform an A record query against the DNS as we noticed in the coredns logs. So what might be causing this issue? Turns into this is a normal behavior for dockerd.

// File: vendor/github.com/docker/libnetwork/resolver.go

resp, err = co.ReadMsg()
// Truncated DNS replies should be sent to the client so that the
// client can retry over TCP
if err != nil && err != dns.ErrTruncated {
   r.forwardQueryEnd()
   logrus.Debugf("[resolver] read from DNS server failed, %s", err)
   continue
}
r.forwardQueryEnd()

So this causes the sequence of quereis to be triggered against DNS server for the A and AAAA records. And the request eventually times out before it could actually achieve the final required IP address that it can use for performing the actual connection establishment and that is why the following error.

169.254.20.10:49039->169.254.20.10:53: i/o timeout

Indicating that the request to the DNS Server timed out after a certain interval.

The Remedy

Turns out the resolv.conf already has a fix for this and it is a simple one called edns0 as per the RFC-2671. This changes the default request behavior that will be used while performing the queries against the DNS server while requesting the A and AAAA records and it turns out to be super easy to configure.

options edsn0

Add the above at the bottom of the /etc/resolv.conf file and that is all you need to do. And you also need to restart the docker daemon so that this new configuration is honored by docker. (Turns out docker was supporting this behavior all along with it’s libnetwork utility)

root@tardis-master-1:/home/tardis# cat /etc/resolvconf/resolv.conf.d/base
#---------------------------------------------------------------------
# Modified by tardis: Tue, 03 Sep 2019 09:03:14 UTC
# tardis-config 1716
#---------------------------------------------------------------------
search default.svc.cluster.local tardis-system.svc.cluster.local svc.cluster.local cluster.local local
options edns0

root@tardis-master-1:/home/tardis# cat /etc/resolv.conf
nameserver 169.254.20.10
search default.svc.cluster.local tardis-system.svc.cluster.local svc.cluster.local cluster.local local
options edns0

Check 1

Let’s pull the images again and see how do they perform now.

[Fri Dec 06 06:26:09 UTC] tardis@10.106.176.205 (tardis-master-1) ~
$ docker pull dockerhub.company.com/tardis-docker/some-image:1.5.38
1.5.38: Pulling from tardis-docker/some-image
f2233041f557: Already exists
f321bcc6a76c: Already exists
2f25d8d1d058: Already exists
87bfe0d2f0e8: Already exists
145c1bf7947a: Already exists
ea0a6f484557: Already exists
4d03226aa6f1: Already exists
79cf72173686: Already exists
31864de8b88a: Already exists

Good, now that docker has started pulling the image, let’s quickly check the status of the tcpdump, conntrack and coredns logs to see how they behave differently now.

2019-12-06T06:28:04.119Z [INFO] 169.254.20.10:49622 - 9335 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 37 0.000263135s
2019-12-06T06:28:04.119Z [INFO] 169.254.20.10:49622 - 43254 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000332536s
2019-12-06T06:28:04.272Z [INFO] 169.254.20.10:40189 - 15099 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 37 0.000130516s
2019-12-06T06:28:04.272Z [INFO] 169.254.20.10:40189 - 4122 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000219738s
2019-12-06T06:28:27.429Z [INFO] 169.254.20.10:58124 - 12003 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,rd,ra 1678 0.025104794s
2019-12-06T06:28:27.431Z [INFO] 169.254.20.10:58124 - 55950 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,rd,ra 48 0.02649031s
2019-12-06T06:28:28.812Z [INFO] 169.254.20.10:48714 - 46081 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 37 0.000123496s
2019-12-06T06:28:28.812Z [INFO] 169.254.20.10:48714 - 45597 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000331035s
2019-12-06T06:28:29.201Z [INFO] 169.254.20.10:58678 - 27283 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 37 0.000140919s
2019-12-06T06:28:29.201Z [INFO] 169.254.20.10:58678 - 1256 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000179996s
2019-12-06T06:28:30.101Z [INFO] 169.254.20.10:50055 - 54440 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 37 0.000185493s
2019-12-06T06:28:30.101Z [INFO] 169.254.20.10:50055 - 39023 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000267243s
2019-12-06T06:28:31.120Z [INFO] 169.254.20.10:46638 - 29248 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 37 0.000139373s
2019-12-06T06:28:31.120Z [INFO] 169.254.20.10:46638 - 30294 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000198622s
2019-12-06T06:28:32.007Z [INFO] 169.254.20.10:57405 - 41092 "AAAA IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 37 0.000109586s
2019-12-06T06:28:32.007Z [INFO] 169.254.20.10:57405 - 21216 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000135747s

If you notice above there are no more tcp requests being done for A or AAAA records anymore. SO what chagned?

If you look at the below logs.

2019-12-06T06:28:04.119Z [INFO] 169.254.20.10:49622 - 43254 "A IN dockerhub.company.com. udp 48 false 1024" NOERROR qr,aa,rd,ra 1667 0.000332536s

you notice that the receive buffer is not 1024 bytes instead of the previous 512 bytes and the response and qr,aa,rd,ra flags indicate that there is no tc flag set anymore. So no more truncation. And we also notice that there is an AuthoritativeAnswer(aa) flag set indicating that we have successfully namaged to resolve the name to an IP address and dockerd can no go ahead and take over to setup the required connection and start pulling things.

[UPDATE] tcp      6 120 FIN_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44208 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44208 [ASSURED]
 [UPDATE] tcp      6 3600 CLOSE_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44208 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44208 [ASSURED]
 [UPDATE] tcp      6 30 LAST_ACK src=10.106.176.205 dst=1.2.3.4 sport=44208 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44208 [ASSURED]
 [UPDATE] tcp      6 10 CLOSE src=10.106.176.205 dst=1.2.3.4 sport=44208 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44208 [ASSURED]
    [NEW] tcp      6 120 SYN_SENT src=10.106.176.205 dst=1.2.3.4 sport=44954 dport=443 [UNREPLIED] src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44954
 [UPDATE] tcp      6 60 SYN_RECV src=10.106.176.205 dst=1.2.3.4 sport=44954 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44954
 [UPDATE] tcp      6 86400 ESTABLISHED src=10.106.176.205 dst=1.2.3.4 sport=44954 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44954 [ASSURED]
 [UPDATE] tcp      6 120 FIN_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44678 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44678 [ASSURED]
 [UPDATE] tcp      6 300 CLOSE_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44678 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44678 [ASSURED]
 [UPDATE] tcp      6 30 LAST_ACK src=10.106.176.205 dst=1.2.3.4 sport=44678 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44678 [ASSURED]
    [NEW] tcp      6 120 SYN_SENT src=10.106.176.205 dst=1.2.3.4 sport=44974 dport=443 [UNREPLIED] src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44974
 [UPDATE] tcp      6 10 CLOSE src=10.106.176.205 dst=1.2.3.4 sport=44678 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44678 [ASSURED]
 [UPDATE] tcp      6 60 SYN_RECV src=10.106.176.205 dst=1.2.3.4 sport=44974 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44974
 [UPDATE] tcp      6 86400 ESTABLISHED src=10.106.176.205 dst=1.2.3.4 sport=44974 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44974 [ASSURED]
 [UPDATE] tcp      6 120 FIN_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44954 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44954 [ASSURED]
 [UPDATE] tcp      6 300 CLOSE_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44954 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44954 [ASSURED]
 [UPDATE] tcp      6 30 LAST_ACK src=10.106.176.205 dst=1.2.3.4 sport=44954 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44954 [ASSURED]
    [NEW] tcp      6 120 SYN_SENT src=10.106.176.205 dst=1.2.3.4 sport=44978 dport=443 [UNREPLIED] src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44978
 [UPDATE] tcp      6 10 CLOSE src=10.106.176.205 dst=1.2.3.4 sport=44954 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44954 [ASSURED]
 [UPDATE] tcp      6 60 SYN_RECV src=10.106.176.205 dst=1.2.3.4 sport=44978 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44978
 [UPDATE] tcp      6 86400 ESTABLISHED src=10.106.176.205 dst=1.2.3.4 sport=44978 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44978 [ASSURED]
 [UPDATE] tcp      6 120 FIN_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44978 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44978 [ASSURED]
 [UPDATE] tcp      6 300 CLOSE_WAIT src=10.106.176.205 dst=1.2.3.4 sport=44978 dport=443 src=1.2.3.4 dst=10.106.176.205 sport=443 dport=44978 [ASSURED]

For each layer being pulled we see a sequence of conntrack ops being created now.

root@tardis-master-1:/home/tardis# tcpdump tcp and port 53 -i lo -n -t | grep -a1 dockerhub.company.com
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes

No more tcp requests being made to the dns server to resolve dockerhub.company.com

--
IP 169.254.20.10.53 > 169.254.20.10.52960: 47243*- 0/1/0 (151)
IP 169.254.20.10.58124 > 169.254.20.10.53: 12003+ [1au] A? dockerhub.company.com. (48)
IP 169.254.20.10.58124 > 169.254.20.10.53: 55950+ [1au] AAAA? dockerhub.company.com. (48)
IP 169.254.20.10.53 > 169.254.20.10.58124: 12003 1/13/14 A 1.2.3.4 (816)
IP 169.254.20.10.53 > 169.254.20.10.58124: 55950 0/0/1 (48)
IP 169.254.20.10.48714 > 169.254.20.10.53: 45597+ [1au] A? dockerhub.company.com. (48)
IP 169.254.20.10.48714 > 169.254.20.10.53: 46081+ [1au] AAAA? dockerhub.company.com. (48)
IP 169.254.20.10.53 > 169.254.20.10.48714: 46081* 0/0/1 (48)
IP 169.254.20.10.53 > 169.254.20.10.48714: 45597* 1/13/14 A 1.2.3.4 (816)
IP 169.254.20.10.58678 > 169.254.20.10.53: 1256+ [1au] A? dockerhub.company.com. (48)
IP 169.254.20.10.58678 > 169.254.20.10.53: 27283+ [1au] AAAA? dockerhub.company.com. (48)
IP 169.254.20.10.53 > 169.254.20.10.58678: 27283* 0/0/1 (48)
IP 169.254.20.10.53 > 169.254.20.10.58678: 1256* 1/13/14 A 1.2.3.4 (816)
IP 169.254.20.10.50055 > 169.254.20.10.53: 39023+ [1au] A? dockerhub.company.com. (48)
IP 169.254.20.10.50055 > 169.254.20.10.53: 54440+ [1au] AAAA? dockerhub.company.com. (48)
IP 169.254.20.10.53 > 169.254.20.10.50055: 54440* 0/0/1 (48)
IP 169.254.20.10.53 > 169.254.20.10.50055: 39023* 1/13/14 A 1.2.3.4 (816)
IP 169.254.20.10.46638 > 169.254.20.10.53: 30294+ [1au] A? dockerhub.company.com. (48)
IP 169.254.20.10.46638 > 169.254.20.10.53: 29248+ [1au] AAAA? dockerhub.company.com. (48)
IP 169.254.20.10.53 > 169.254.20.10.46638: 29248* 0/0/1 (48)

And no more calls to the DNS server for A or AAAA record in udp mode by extending the items out of search domains since we already have an authoritative answer.