Container Failed to start

Thanks in advance to any who can possibly assist. I’ve been learning quite a bit about Calico over the past couple of weeks, please excuse a new user and community member. I am running into an issue with Calico which has been running on a kubernetes cluster for a long time. Seemingly with no issues which I am aware of. The issue is the calico-kube-controllers deployment does not start. Enabling debugging didn’t really provide any additional information, here are some logs I was able to collect.

2021-02-22T18:38:45.643542239Z 2021-02-22 18:38:45.642 [INFO][1] main.go 75: Loaded 
configuration from environment config=&config.Config{LogLevel:"debug", ReconcilerPeriod:"5m", CompactionPeriod:"10m", EnabledControllers:"policy,namespace,serviceaccount,workloadendpoint,node", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", HealthEnabled:true}
2021-02-22T18:38:45.643593132Z 2021-02-22 18:38:45.643 [DEBUG][1] load.go 70: Loading config from environment
2021-02-22T18:38:45.643598551Z 2021-02-22 18:38:45.643 [DEBUG][1] client.go 30: Using datastore type 'etcdv3'
2021-02-22T18:38:55.643928454Z 2021-02-22 18:38:55.643 [FATAL][1] main.go 87: Failed to start error=failed to build Calico client: context deadline exceeded

Can anyone point me in a direction where I can obtain more information about what might be failing?

It looks like your Calico installation is using the etcd datastore driver and it is failing to connect to etcd.

  • Was this always an etcd-based cluster? (If not, you may have accidentally installed an etcd driver manifest instead of a kubernetes one).
  • Is the etcd configuration correct; I believe this is in the calico-config config map in the kube-system namespace? Perhaps you’re running on a fabric with dynamic IPs and etcd was restarted with new IP.
  • Is etcd running and healthy? Have you upgraded or changes your etcd in some way recently?
  • Are you using Calico host endpoints (host protection policy); it’s possible to accidentally cut the connection to the datastore if so?

Thanks @fasaxc. Yes, it was always etcd based, I have no reason to believe the etcd configuration is not correct. We’re validating your other questions and points now.

I did find digging through logs that etcd did have a hiccup the other day which is when this started happening. The etcd nodes seemed to lose their interconnectivity and after a minute or so they re-established themselves. They all seem happy now and they all seem like they are communicating ok.

calico-node is also up on all of the worker nodes and seems to be running fine. The only issue is calico-kube-controllers, I still haven’t figured out what is stopping calico-kube-controllers from communicating with etcd. Still digging.

@fasaxc - Sorry for the delay in getting back to you. I appreciate your previous assistance, it was indeed a communication issue between calico and etcd, your words led me down the right road. An etcd-ca signed client certificate for calico had expired which was causing issues. Certificates were renewed, old nodes were culled and everything is happy once again.