I wanted to discuss what failures can occur with calico and what outages it produces.
In a network with heavy packet loss: what would happen to the runtime configurations of calico? Will a component go crazy, deletes routes, or something like that?
If, at some point, you would shut down all calico components: what would happen to a Kubernetes cluster?
I try to understand what additional risks will be created if you add the complexity of calico.
The thing to remember is that Calico is NOT on the data path.
The calico-node pod simply programs routes and iptables (or BPF programs) into the kernel. Obviously the kernel is then a possible point of failure, but if your kernel is having issues passing traffic, your node is toast anyway.
If the calico-node pods are all deleted, new pods will not be networked, and changes to policies will not be implemented. But everything else will continue as normal.
Calico typically gets its config from the kubernetes API server. If packet loss is bad enough that it has trouble talking to this, calico-node will be restarted to try to recover. We’re not aware of any scenarios where it “goes crazy” or deletes routes. This was a consideration in the original design