Fragmented packets don't arrive at docker container in k8s cluster

jpl166 · May 8, 2020, 11:57pm

I’ve got quite a scenario here and I’ve been fighting with it for several days full-time so this will probably be a lot to read. Very high level: gitlab latest version, upgraded during the course of this fight just to be sure. Two Kubernetes clusters, one bare-metal for dev purposes and one at Digital Ocean for prod. Just doing the integration and starting to work with autodeployment and builds of my test project fail on the dev cluster and succeed on the prod cluster. Doing the cluster setup in gitlab went completely painlessly, and installing the gitlab runners and all was easy. (Prometheus doesn’t like k8s 1.18 that I’m running in dev, but that’s a different topic.)

Dev cluster: ubuntu 18.04 running k8s 1.18, 5 nodes, docker 19.03.6, calico networking
Prod cluster: Digital Ocean hosted, k8s 1.16.6, … whatever the heck the rest is DO hides from me.
Test build: I’m working from the instructions at https://docs.bitnami.com/tutorials/create-ci-cd-pipeline-gitlab-kubernetes/ to get this all set up, and I’m using his test node.js Hello, World! app. I’ve tried upgrading it to node 12, that made no difference.

What I expect to happen: container builds no matter which cluster it picks to do the build
What is happening: container fails to build on dev cluster, succeeds on prod cluster.

The build fires up, grabs all the parts it needs, and then hangs for a while at the “npm install” step until it times out and fails. Making it an “npm install --verbose” showed me that it’s timing out getting to https://registry.npmjs.org/express. That works from my browser and from DO, I know it’s not the site. So I added a “sleep 3600” to the Dockerfile and logged in to the pod. From there I can wget https://registry.npmjs.org/express just fine. Discovered that the svc-0 container in the pod has a docker container within it, so I logged in to THAT. curl -v https://registry.npmjs.org/express hangs immediately after the TLS 1.3 Client Hello.

This is where it gets “fun”.

Dropped tcpdump on the node on the outside interface for port 443 and on the inside cali interface, on the gitlab-runner svc-0 container, and on the container it’s building all the way in, and ran the curl. I see the TCP connection get set up, the TLS negotiation start , and then it dies. I see the Server Hello coming back in at the host’s external interface, but it doesn’t get forwarded to the cali interface to head in further.

Did the same thing with https://www.google.com and it works fine.

Tested more stuff, found that registry.npmjs.org is returning the entire certificate chain and google is not, and it’s doing it as a single 3+KB packet that is getting fragmented somewhere along the way in to 3 pieces. Google is not sending the whole chain, just their cert, and that fits in one normal-sized packet. Tested the curl against another site I know is sending the whole chain… HANG. So for some reason it is not correctly handling fragmented packets, apparently within the Ubuntu network stack or possibly the cali drivers, but only when those packets are supposed to go from the node -> docker container -> docker container. If it’s just one docker layer in it works fine.

I’m sending this here first, as opposed to Ubuntu, Kubernetes, Docker, Alpine (the gitlab-runner), or Debian (the node.js image) because the problem is possibly manifesting in the network layer where the Ubuntu node would be handing traffic off to the K8S network. I also sent this to gitlab “first” because it only manifests when gitlab-runner is doing a build. If we come up with any sort of credible reason to point at any particular of the products in the stack, I’ll happily go chase them down, too.

Any ideas? This one has me pretty well stumped.

jpl166 · May 11, 2020, 3:28pm

Realized I’d missed a detail that might be useful. This has been running Calico 3.11.2. The upgrade to 3.14 is running now and I will update with results.

jpl166 · May 11, 2020, 5:39pm

Calico 3.14.0 did not change the behavior.

fasaxc · May 12, 2020, 3:55pm

I don’t think we do anything to deliberately drop fragments. Are you sure that you’re seeing IP fragments and not just separate TCP packets carrying parts of the data?

Is your MTU configured correctly on your main interface? Are Calico’s MTUs configured correctly: https://docs.projectcalico.org/networking/mtu

jpl166 · May 12, 2020, 6:01pm

Didn’t think this would be intentional, that would break a lot of things really badly. Wireshark is tagging the second as “TCP segment of a reassembled PDU”. I’ll figure out how to attach the capture file here shortly.

MTUs at every point within my control are defaults.

jpl166 · May 12, 2020, 6:05pm

Packets 6, 7, and 8. Those never arrive in any form.

fasaxc · May 13, 2020, 8:50am

Thanks for sending the capture, that’s very useful. Since this is a TCP connection, all the packets have the “don’t fragment” bit set so IP-level fragmentation doesn’t look like the issue (which is good because IP fragmentation is horrible). When Wireshark says that it reassembled the TCP segments it’s because the TLS “Certificate, Server Key Exhange” message was split across three TCP packets, but Wireshark wants to show you the reassembled TLS message. Its UI is confusing when it groups packets like that.

Apart from that, it looks like packet 6 is being dropped so the client resends its ACK for packet 5 and the server keeps trying to resend packet 6.

What is the MTU of the caliXXX interface for the pod? What is the MTU retported on eth0 inside the pod? The client is signalling an MSS of 1460, which seems too high for our default MTU.

fasaxc · May 13, 2020, 9:00am

Just to confirm, you’re not using the new BPF dataplane?

jpl166 · May 13, 2020, 1:48pm

On the node the mtu for the cali* interface is 1440. I’m fairly certain I was not using the BPF dataplane, however…

SO… DigitalOcean uses Cilium, so in the interests of making dev and prod as similar as possible I pulled out Calico and installed Cilium. Again, all defaults. I’m still pretty new to Kubernetes (although I’m in no way new to IT infrastructure, I presume that initial dump up above showed that) so I’m tending to not customize too many things if I don’t have to. The only customization for Calico had been setting the CALICO_IPV4_CIDR value so it wouldn’t conflict with any of the other places we’re using RFC1918 addressing internally.

So now all of my testing is clean and I’ve just run a build via gitlab-runner that was successful.

fasaxc · May 19, 2020, 2:03pm

I tried reproducing this locally but it works for me. There were some differences I noticed in the trace

Initially, I saw a 3000 byte packet even though my host’s MTU was 1500. This was due to the kernel GRO combining the packets. The request was successful nonetheless. With GRO enabled, the veth’s MTU is essentially ignored.
After disabling GRO on my upstream interface my trace looked more like yours.
However, the pod’s veth MTU was 1440 both inside and outside the pod whereas the MSS in your trace makes me think your pod’s veth MTU was 1500.
When I curled the service you mentioned, I saw the pod signal an MSS consistent with 1440 MTU and the response packets were of the expected size. Again, even with GRO disabled, the response got through.

I’m wondering if there’s something odd about your set-up. Were you using Calico CNI? Were you using CNI chaining or multus, for example, which may have changed the veth set-up?

jpl166 · May 19, 2020, 2:23pm

Here’s the yaml: https://drive.google.com/open?id=1hBbZqT3GGF8p_eqMyZ4uaOiFQM7L6Kaz

I literally changed one line from the version I grabbed from the Calico download, and that’s the CIDR I referred to. I do see the 1440 MTU is set in there, so that matches what you had determined. I had originally installed 3.11.2 and got the same behavior, and had made that same change to the downloaded file.

fasaxc · May 19, 2020, 2:49pm

What about your base OS, any particular config you apply there? Is it just vanilla Ubuntu 18.04?

jpl166 · May 19, 2020, 3:03pm

Nothing network related getting tweaked. It’s puppet controlled Ubuntu 18.04, but looking at what all we have puppet doing it’s setting the root password, sshd configs, LDAP auth, NTP, zabbix monitoring, and installing docker.

Topic		Replies	Views
Calico-managed container communication accross hosts fails Open Source Calico Help kubespray	3	1172	December 16, 2020
The tunl0 interface is not created on k8s workers outside the cloud Cloud Help	3	1677	March 24, 2021
Newb-Q Can't ping pod network from master (IaaS k8s 1.20.4 in Azure on ubuntu 18.04 w/ calico) Open Source Calico Help	5	752	April 13, 2021
K8s does not create calico pods after apply -f Open Source Calico Help	5	911	May 28, 2021
Calico the hard way: stuck on stage "Test networking" -- pod to pod communication not working Networking	0	637	June 5, 2021

Fragmented packets don't arrive at docker container in k8s cluster

Related Topics