Kamlesh Prajapati
3 min readJan 12, 2023

--

OpenShift API Connection lost …

There are many issue on day 2 day basis on openshift but the one I am describing here in this blog is quite interesting issue I have seen this issue recently in one of environment, Where OpenShift login is not working , OpenShift API connection lost.

As OpenShift administrator what will be the first thing that you will do to troubleshoot and fix the API issue.

First place you need to login to the cluster and check the controller pods(kube-apiserver, authenticator-server pods etc) but your API login itself is not working so now here once need to think how I can check the container.

Now challenge here is login and check the container, so there is way to login to cluster either by using the ‘kubeconfig’ or SSH the master node directly from bastion server by using ssh-key.

Steps to login:

#ssh <private-key> usr-name@<bastion-server-ip>

#ssh <private-key> core@<master-node-ip>

Note: OCP 4.x has common user name for nodes i.e. core

Once you logged in into the master node try checking the pods running in the master node by using CRI-O commands

[core@wed-qpkwz-master-1 ~]$ sudo crictl pods
POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME
6f65d2f5c3qew8 32 hours ago NotReady oauth-openshift-f5894bbc08-57sz8 openshift-authentication 0 (default)
d4a44a5d65wew8 2 days ago NotReady revision-pruner-200-wed-qpkwz-master-1 openshift-kube-apiserver 0 (default)
2e48032b47954e 2 days ago NotReady kube-apiserver-wed-qpkwz-master-1 openshift-kube-apiserver 0 (default)

As you can see above the pods are showing in notReady state, Now before restart these pods, You need to check the ETCD quorum status becuase if incase ectd is down or no leader elected in the ectd table then even restart of the pods doesn't work, In that case first you need to recover the ectd server , Once ETCD server is recovered and LEADER is choosen , You can restart the control-plan pods

Now let see how to check and recover the ETCD Server

  1. login to etcd container by using below command:
    $sudo crictl exec -it <container-id> sh
  2. Once you logged in etcd pod run the blow command to check the etcd members status
    $etcdctl member list -w table
  3. Now Check the etcd endpoints by using the below command:
    $etcdctl endpoint status -w table
    $
    eetcdctl endpoint health — cluster

If no LEADER elected on the table reboot the etcd server pod one by one for each master

Repeat the steps 1,2 and 3 to check if etcd member back to normal and one of the etcd member elected as LEADER

Recover the control-plan pods by delete the pod

#sudo crictl rm <pod-name>

Now since your etcd server pod is up and running you can go ahead and reboot the kube-apiserver and oauth-openshift pods

Once all the pods back to running state wait for sometime untill you see all the containers are up and running.

Now check the API login should be working fine now.

Now lets talk about some common cause of etcd down/unhealthiness

An under-resourced file system : etcd consensus protocol requires etcd cluster members to write every request down to disk and every time a key is updated for a cluster, a new revision is created. When the system runs low on space (usage above 75%), etcd goes read/delete only until revisions and keys are removed or disk space is added. For optimal performance, its recommeded to keep disk usage below 75%.

I/O Contention: The etcd consensus protocol requires etcd cluster members to write every request down to disk, making it very sensitive to disk write latency. Systems under heavy loads, particularly during peak or maintenance hours, are susceptible to I/O bottlenecks as processes are forced to compete for resources. This contention can increase I/O wait time and prevent Patroni from receiving an answer from etcd before the heartbeat times out.

Network Delay: The etcd system, which is critical for the stability of the HA solution, is experiencing issues that register as network transit timeouts. This could be due to either actual network timeouts or massive resource starvation at the etcd level. If you notice timeout errors typically coincide with periods of heavy network traffic, then network delay could be the root cause of these timeout errors.

Thanks for reading…

Based on Issue and experience with that said it dose not guarantee that this will work on all the cases but for sure this will give some idea if you have similar issue..

--

--

Kamlesh Prajapati

DevOps Practitioner (CKA certified , RHOCP Certified, Azure Certified on az-104,az-400,az-303.)