Fragmented packets don't arrive at docker container in k8s cluster

I’ve got quite a scenario here and I’ve been fighting with it for several days full-time so this will probably be a lot to read. Very high level: gitlab latest version, upgraded during the course of this fight just to be sure. Two Kubernetes clusters, one bare-metal for dev purposes and one at Digital Ocean for prod. Just doing the integration and starting to work with autodeployment and builds of my test project fail on the dev cluster and succeed on the prod cluster. Doing the cluster setup in gitlab went completely painlessly, and installing the gitlab runners and all was easy. (Prometheus doesn’t like k8s 1.18 that I’m running in dev, but that’s a different topic.)

Dev cluster: ubuntu 18.04 running k8s 1.18, 5 nodes, docker 19.03.6, calico networking
Prod cluster: Digital Ocean hosted, k8s 1.16.6, … whatever the heck the rest is DO hides from me.
Test build: I’m working from the instructions at https://docs.bitnami.com/tutorials/create-ci-cd-pipeline-gitlab-kubernetes/ to get this all set up, and I’m using his test node.js Hello, World! app. I’ve tried upgrading it to node 12, that made no difference.

What I expect to happen: container builds no matter which cluster it picks to do the build
What is happening: container fails to build on dev cluster, succeeds on prod cluster.

The build fires up, grabs all the parts it needs, and then hangs for a while at the “npm install” step until it times out and fails. Making it an “npm install --verbose” showed me that it’s timing out getting to https://registry.npmjs.org/express. That works from my browser and from DO, I know it’s not the site. So I added a “sleep 3600” to the Dockerfile and logged in to the pod. From there I can wget https://registry.npmjs.org/express just fine. Discovered that the svc-0 container in the pod has a docker container within it, so I logged in to THAT. curl -v https://registry.npmjs.org/express hangs immediately after the TLS 1.3 Client Hello.

This is where it gets “fun”.

Dropped tcpdump on the node on the outside interface for port 443 and on the inside cali interface, on the gitlab-runner svc-0 container, and on the container it’s building all the way in, and ran the curl. I see the TCP connection get set up, the TLS negotiation start , and then it dies. I see the Server Hello coming back in at the host’s external interface, but it doesn’t get forwarded to the cali interface to head in further.

Did the same thing with https://www.google.com and it works fine.

Tested more stuff, found that registry.npmjs.org is returning the entire certificate chain and google is not, and it’s doing it as a single 3+KB packet that is getting fragmented somewhere along the way in to 3 pieces. Google is not sending the whole chain, just their cert, and that fits in one normal-sized packet. Tested the curl against another site I know is sending the whole chain… HANG. So for some reason it is not correctly handling fragmented packets, apparently within the Ubuntu network stack or possibly the cali drivers, but only when those packets are supposed to go from the node -> docker container -> docker container. If it’s just one docker layer in it works fine.

I’m sending this here first, as opposed to Ubuntu, Kubernetes, Docker, Alpine (the gitlab-runner), or Debian (the node.js image) because the problem is possibly manifesting in the network layer where the Ubuntu node would be handing traffic off to the K8S network. I also sent this to gitlab “first” because it only manifests when gitlab-runner is doing a build. If we come up with any sort of credible reason to point at any particular of the products in the stack, I’ll happily go chase them down, too.

Any ideas? This one has me pretty well stumped.

Realized I’d missed a detail that might be useful. This has been running Calico 3.11.2. The upgrade to 3.14 is running now and I will update with results.

Calico 3.14.0 did not change the behavior.

I don’t think we do anything to deliberately drop fragments. Are you sure that you’re seeing IP fragments and not just separate TCP packets carrying parts of the data?

Is your MTU configured correctly on your main interface? Are Calico’s MTUs configured correctly: https://docs.projectcalico.org/networking/mtu

Didn’t think this would be intentional, that would break a lot of things really badly. Wireshark is tagging the second as “TCP segment of a reassembled PDU”. I’ll figure out how to attach the capture file here shortly.

MTUs at every point within my control are defaults.

Packets 6, 7, and 8. Those never arrive in any form.

Thanks for sending the capture, that’s very useful. Since this is a TCP connection, all the packets have the “don’t fragment” bit set so IP-level fragmentation doesn’t look like the issue (which is good because IP fragmentation is horrible). When Wireshark says that it reassembled the TCP segments it’s because the TLS “Certificate, Server Key Exhange” message was split across three TCP packets, but Wireshark wants to show you the reassembled TLS message. Its UI is confusing when it groups packets like that.

Apart from that, it looks like packet 6 is being dropped so the client resends its ACK for packet 5 and the server keeps trying to resend packet 6.

What is the MTU of the caliXXX interface for the pod? What is the MTU retported on eth0 inside the pod? The client is signalling an MSS of 1460, which seems too high for our default MTU.

Just to confirm, you’re not using the new BPF dataplane?

On the node the mtu for the cali* interface is 1440. I’m fairly certain I was not using the BPF dataplane, however…

SO… DigitalOcean uses Cilium, so in the interests of making dev and prod as similar as possible I pulled out Calico and installed Cilium. Again, all defaults. I’m still pretty new to Kubernetes (although I’m in no way new to IT infrastructure, I presume that initial dump up above showed that) so I’m tending to not customize too many things if I don’t have to. The only customization for Calico had been setting the CALICO_IPV4_CIDR value so it wouldn’t conflict with any of the other places we’re using RFC1918 addressing internally.

So now all of my testing is clean and I’ve just run a build via gitlab-runner that was successful.

I tried reproducing this locally but it works for me. There were some differences I noticed in the trace

  • Initially, I saw a 3000 byte packet even though my host’s MTU was 1500. This was due to the kernel GRO combining the packets. The request was successful nonetheless. With GRO enabled, the veth’s MTU is essentially ignored.
  • After disabling GRO on my upstream interface my trace looked more like yours.
  • However, the pod’s veth MTU was 1440 both inside and outside the pod whereas the MSS in your trace makes me think your pod’s veth MTU was 1500.
  • When I curled the service you mentioned, I saw the pod signal an MSS consistent with 1440 MTU and the response packets were of the expected size. Again, even with GRO disabled, the response got through.

I’m wondering if there’s something odd about your set-up. Were you using Calico CNI? Were you using CNI chaining or multus, for example, which may have changed the veth set-up?

Here’s the yaml: https://drive.google.com/open?id=1hBbZqT3GGF8p_eqMyZ4uaOiFQM7L6Kaz

I literally changed one line from the version I grabbed from the Calico download, and that’s the CIDR I referred to. I do see the 1440 MTU is set in there, so that matches what you had determined. I had originally installed 3.11.2 and got the same behavior, and had made that same change to the downloaded file.

What about your base OS, any particular config you apply there? Is it just vanilla Ubuntu 18.04?

Nothing network related getting tweaked. It’s puppet controlled Ubuntu 18.04, but looking at what all we have puppet doing it’s setting the root password, sshd configs, LDAP auth, NTP, zabbix monitoring, and installing docker.