Dear Calico community,
based on the following environment:
- K8s: 1.19.3
- Calico: 3.16.4
deployed via:
- Kubespray: 2.11
we use the following sample K8s-YAML providing:
- 2x K8s-PODs
- comprising a single container image “network-multitool” (offering several network tools like “nc”, “ping”, “traceroute”, etc.) each
- on DIFFERENT physical nodes
- with a service attached to “mt-server” to allow the “mt-client” to reach it:
apiVersion: v1 kind: Namespace metadata: name: clusterdbg --- # Pod 1 - Role: server apiVersion: v1 kind: Pod metadata: name: mt-server namespace: clusterdbg labels: app: mt-server spec: # Pod is scheduled on nodes that are labelled with dbgnode=n1 nodeSelector: dbgnode: n1 tolerations: - key: "node.kubernetes.io/unschedulable" operator: "Equal" effect: "NoSchedule" containers: - command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 30; done;"] image: registry-prod.xxxxxx.corpintra.xxx/clusterdbg/praqma/network-multitool:0daefe6 imagePullPolicy: IfNotPresent name: mt-server ports: # TCP-Port 9090 - containerPort: 9090 # UDP-Port 9191 - containerPort: 9091 --- # Pod 2 - Role: client apiVersion: v1 kind: Pod metadata: name: mt-client namespace: clusterdbg spec: # Pod is scheduled on nodes that are labelled with dbgnode=n2 nodeSelector: dbgnode: n2 tolerations: - key: "node.kubernetes.io/unschedulable" operator: "Equal" effect: "NoSchedule" containers: - command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 30; done;"] image: registry-prod.xxxxxx.corpintra.xxx/clusterdbg/praqma/network-multitool:0daefe6 imagePullPolicy: IfNotPresent name: mt-client --- # Cluster IP service that is exposing the mt-server apiVersion: v1 kind: Service metadata: name: mt-server-clusterip namespace: clusterdbg spec: selector: app: mt-server ports: - protocol: TCP port: 9090 targetPort: 9090 name: tcp-service - protocol: UDP port: 9191 targetPort: 9191 name: udp-service
As long as we try to ping from “mt-client” to “mt-server” we have success. As soon as we try to reach the server via TCP protocols, e.g. via “netcat” we fail (no “Hello” reaches the netcat service running on “mt-server” listening on port 9090):
#starting the NetCat server TCP listen process on “mt-server” on port 9090
nc -l -p 9090
#writing to the NetCat server from “mt-client” to port 9090
echo “Hello” | nc <target_ip> 9090
After analyzing TCP SYN flags on different network devices (container device, tunnel device, host device) we get the following picture:
Hints:
- Inter-host communication worked, suddenly it fails and we do not know why
- host-local communication works (e.g. when the PODs are running on the same physical node)
- we followed the Calico troubleshooting guide below https://docs.projectcalico.org/archive/v3.16/maintenance/troubleshoot/troubleshooting without detecting any obvious errors
Does the Calico community have any ideas what we can do to get rid of this issue?