Dear Calico community,
based on the following environment:
- K8s: 1.19.3
- Calico: 3.16.4
deployed via:
- Kubespray: 2.11
we use the following sample K8s-YAML providing:
- 2x K8s-PODs
- comprising a single container image “network-multitool” (offering several network tools like “nc”, “ping”, “traceroute”, etc.) each
- on DIFFERENT physical nodes
- with a service attached to “mt-server” to allow the “mt-client” to reach it:
apiVersion: v1 kind: Namespace metadata: name: clusterdbg --- # Pod 1 - Role: server apiVersion: v1 kind: Pod metadata: name: mt-server namespace: clusterdbg labels: app: mt-server spec: # Pod is scheduled on nodes that are labelled with dbgnode=n1 nodeSelector: dbgnode: n1 tolerations: - key: "node.kubernetes.io/unschedulable" operator: "Equal" effect: "NoSchedule" containers: - command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 30; done;"] image: registry-prod.xxxxxx.corpintra.xxx/clusterdbg/praqma/network-multitool:0daefe6 imagePullPolicy: IfNotPresent name: mt-server ports: # TCP-Port 9090 - containerPort: 9090 # UDP-Port 9191 - containerPort: 9091 --- # Pod 2 - Role: client apiVersion: v1 kind: Pod metadata: name: mt-client namespace: clusterdbg spec: # Pod is scheduled on nodes that are labelled with dbgnode=n2 nodeSelector: dbgnode: n2 tolerations: - key: "node.kubernetes.io/unschedulable" operator: "Equal" effect: "NoSchedule" containers: - command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 30; done;"] image: registry-prod.xxxxxx.corpintra.xxx/clusterdbg/praqma/network-multitool:0daefe6 imagePullPolicy: IfNotPresent name: mt-client --- # Cluster IP service that is exposing the mt-server apiVersion: v1 kind: Service metadata: name: mt-server-clusterip namespace: clusterdbg spec: selector: app: mt-server ports: - protocol: TCP port: 9090 targetPort: 9090 name: tcp-service - protocol: UDP port: 9191 targetPort: 9191 name: udp-service
As long as we try to ping from “mt-client” to “mt-server” we have success. As soon as we try to reach the server via TCP protocols, e.g. via “netcat” we fail (no “Hello” reaches the netcat service running on “mt-server” listening on port 9090):
#starting the NetCat server TCP listen process on “mt-server” on port 9090
nc -l -p 9090
#writing to the NetCat server from “mt-client” to port 9090
echo “Hello” | nc <target_ip> 9090
After analyzing TCP SYN flags on different network devices (container device, tunnel device, host device) we get the following picture:
Hints:
- Inter-host communication worked, suddenly it fails and we do not know why
- host-local communication works (e.g. when the PODs are running on the same physical node)
- we followed the Calico troubleshooting guide below Troubleshoot and diagnostics without detecting any obvious errors
Does the Calico community have any ideas what we can do to get rid of this issue?