Thanks in advance to any who can possibly assist. I’ve been learning quite a bit about Calico over the past couple of weeks, please excuse a new user and community member. I am running into an issue with Calico which has been running on a kubernetes cluster for a long time. Seemingly with no issues which I am aware of. The issue is the calico-kube-controllers deployment does not start. Enabling debugging didn’t really provide any additional information, here are some logs I was able to collect.
2021-02-22T18:38:45.643542239Z 2021-02-22 18:38:45.642 [INFO][1] main.go 75: Loaded
configuration from environment config=&config.Config{LogLevel:"debug", ReconcilerPeriod:"5m", CompactionPeriod:"10m", EnabledControllers:"policy,namespace,serviceaccount,workloadendpoint,node", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", HealthEnabled:true}
2021-02-22T18:38:45.643593132Z 2021-02-22 18:38:45.643 [DEBUG][1] load.go 70: Loading config from environment
2021-02-22T18:38:45.643598551Z 2021-02-22 18:38:45.643 [DEBUG][1] client.go 30: Using datastore type 'etcdv3'
2021-02-22T18:38:55.643928454Z 2021-02-22 18:38:55.643 [FATAL][1] main.go 87: Failed to start error=failed to build Calico client: context deadline exceeded
It looks like your Calico installation is using the etcd datastore driver and it is failing to connect to etcd.
Was this always an etcd-based cluster? (If not, you may have accidentally installed an etcd driver manifest instead of a kubernetes one).
Is the etcd configuration correct; I believe this is in the calico-config config map in the kube-system namespace? Perhaps you’re running on a fabric with dynamic IPs and etcd was restarted with new IP.
Is etcd running and healthy? Have you upgraded or changes your etcd in some way recently?
Are you using Calico host endpoints (host protection policy); it’s possible to accidentally cut the connection to the datastore if so?
Thanks @fasaxc. Yes, it was always etcd based, I have no reason to believe the etcd configuration is not correct. We’re validating your other questions and points now.
I did find digging through logs that etcd did have a hiccup the other day which is when this started happening. The etcd nodes seemed to lose their interconnectivity and after a minute or so they re-established themselves. They all seem happy now and they all seem like they are communicating ok.
calico-node is also up on all of the worker nodes and seems to be running fine. The only issue is calico-kube-controllers, I still haven’t figured out what is stopping calico-kube-controllers from communicating with etcd. Still digging.
@fasaxc - Sorry for the delay in getting back to you. I appreciate your previous assistance, it was indeed a communication issue between calico and etcd, your words led me down the right road. An etcd-ca signed client certificate for calico had expired which was causing issues. Certificates were renewed, old nodes were culled and everything is happy once again.