Node/Kubelet Troubleshooting for OpenShift…

Kamlesh Prajapati
FAUN — Developer Community 🐾
8 min readDec 14, 2022

--

Some of the most common problem you may run it with kubernetes/OpenShift nodes or with kubelet service and In this article we will look at it how to troubleshoot the node issue.

We will see some common node/machine troubleshooting steps on OCP 3.11 and OCP 4.x

As you all know node are the very important and they are the fundamental building block of the cluster. When it comes do work and if we don’t have functional nodes then you do not have cluster that can do anything so called the each node has agent in it called kubelet service which listens on port 10250. kubelet is responsible for taking instruction from control-plane such as which pods start/stops etc and also it communicate back information to the control-plane such as pod status information and or if the kubelet service is running in the problem it will communicate back to teh control-plane, but without a kubelet you really don’t have node. Its a core component of how kubernetes and openshift works.

Node review

A kubelete service instance runs on every node (listens on port 10250)

Takes instructions from the control-plane

Communicates with cluster API

Manages pods running on the nodes

Node architecture

Node Components:

  1. Properly running nodes require a lot of things to be in good working order.

In below diagram you can see little bit more detail about what makes up the node components as you can see kubelet in diagram is relate the all other piece any one of them could be a source of problem.

briefly will talk first lets talk about the operating system itself which is going to be rhel on ocp 3 or rhcos on openshift 4 nodes, so if you have any missing configuration or something is corrupted then that definately may caused the issues with node availability in the cluster sometimes there are bugs that are responsible for bringing the node offline, So those may require patch or workaroud so its good to be aware what curently known bugs are when it comes to operating system components as that can affect node. Other major piece that it extermly important in the diagram is container runtime is this is gone to be CRI-O or can be docker in ocp 3.x but it is the piece responsible for runing the pod workload on the node and container associated with this pods so kubelet delegates all that work of to the runtime but it common cause of node problem so it fundaumental to having healthy node.

One of the resources that often cause node problem is storage offcourse starage is used for whole pod data and logs, container & image information and kubelet is always trying to makesure that storage is available for the node to work correctly so if you run out of storage or have other problem that can bring the node offline and the memory is another extermly important resource that if you ran out of it or ateast running on memory availabilty that lead to poor performance or actualy bring workload down as well and finally there is networking which is criticle abosublty as you know node to function in cluster as becoz the node service has to talk to the control-plane and vice-varsa and if the node and control-plane can’t talk to each other then within sort amount of time node will marked and NotReady.

by the way anyone of the component mention in the diagram can cause a problem.

Gathering Data:

Anyway so when gathering data on node issue this what typically I am looking as early as possible

General node Info:

$oc get node -o wide
$oc get node <NODE_NAME> -o yaml
$oc describe node <NODE_NAME>

Kubelet status and logs from a node’s terminal:

— -On an OpenShift 4.x node (use ‘oc debug node …’ when possible) — —

# systemctl status kubelet
# journalctl -u kubelet — no-pager

— — On an OpenShift 3.x node — —
# systemctl status atomic-openshift-node
# jounrnactl -u atomic-openshift-node — — no-pager

$oc adm node-logs <<NODE_NAME>> -u kubelet

When in doubt (or not), get a sosreport :
— — For an OpenShift 4.x node — —
— → If the kubelet is responding:
$oc debug node/<<NODE_NAME>>
#chroot /host
#sudo su
#toolbox
#sosreport -k crio.logs=true -k crio.all=true

— → If the kubelet is not respoding:
$ssh <<USER@NODE_NAME>>
#chroot /host
#sudo su
#toolbox
#sosreport -k crio.logs=true -k crio.all=true

→ For an OpenShift 3.x node with CRIO — — —
#sosreport -k crio.logs=true -k crio.all=true

→ For and OpenShift 3.x node with Docker — — —
#sosreport -k docker.logs=true -k docker.all=true

redhat solution links:
https://access.redhat.com/solutions/3592
https://access.redhat.com/solutions/4387261
https://access.redhat.com/solutions/3820762

Look for hyperkube and kubelet in OCP 4.x journactl output:

  1. Unfortunately must-gather dosen’t help much in diagnosing node runtime issues and it con’t replace a sosreport
  2. However, the node and machine-config definitions and operator logs may be useful, so it doesn’t hurt to get one anyway

Other Helpful Data

> Important configuration/resource location in OCP 4.x nodes:

— → /etc/kubernetes
— -> /var/lib/kubelet

> In OCP 3.x nodes:

— -> /etc/origin/node

Here are some most common indication of node problem is ‘NotReady’ status

  1. Happens when a node self reports a problem to the control-plane(such as low storage)
  2. Happens if a node stops talking to the control-plane for a period of time
  3. Eventually causes pod workloads to get moved to healthy nodes
  4. Node will no longer take on pods until issue is resolved

Common causes of ‘NotReady

Disk or memory pressure (i.e. low node resources)

  1. Node may get market NotReady if kubelet unable to reclaim resource in a reasonable amount of time
  2. Node status may “oscillate” of “flap” between Ready/NotReady”
  3. Often accompanied by the evicted pods, disrupted applications

4. Look for journal entries similar to:

5. The node’s status conditions may also indicate there’s an issue:

Investigating Disk Pressure:

  1. Use the df and (sometimes) du filesystem tools to check disk usage
  2. Pay attention to directories used by the node components:

— -> For CRIO(4.x and 3.x)
— — — — -> Container/image files: /var/lib/containers/ …

— -> For Docker (OCP 3.x)

— — — — -> Container/image files: /var/lib/docker/ …

— — — — -> Pod logs: /var/lib/docker/containers/…/*.log

— → Ephemeral storage

— — — — → 4.x : /var/lib/kubelet/pods

— — — — → 3.x: /var/lib/origin/openshift.local.volumes/ …

3. Container storage in OCP 4.x and 3.x with CRIO:

4. Container storage In OCP 3.x with Docker:

5. Ephemeral storage in 4.x and 3.x

6. Check Garbage Collection and Eviction setttings

7. Look into application consuming large amounts of ephemeral or log storage
. Implement pod log rotation in OCP 3.x if not already active(OCP 4.x should already be rotating pod logs)

. Docker log rotation: https://access.redhat.com/solutions/2334181

. To adjust 4.x log rotation: https://access.redhat.com/solutions/4924281

8. Use Quotas to enforce ephemeral storage limits for pods.

9. Manually clearing unnecessary container resources:

CRIO: https://access.redhat.com/solutions/5610941

Docker: https://access.redhat.com/solutions/3508421

10. How big is the issue ?

— -> Node storage may need to be expanded

—-> May need to add nodes

Another way to investigate memory pressure is to review metrics related to memory use:

> $oc adm top node (if issue actively/repeatedly occurring)
> $oc adm top pod (if issue actively/repeatedly occurring)
> Grafana/Prometheus metrics if possible

Try to isolate a problematic application(OOM messages in pod and journalctl logs may help)

Review quotas. requests and limits being used by applications

. High (or no) limit values => protentional high memory usage

. Low reques values => high density scheduling( overcrowded nodes)
https://access.redhat.com/solutions/4367311

Expired node certificates:

  1. Expired kubelet certificates are often the cause of nodes going offline(required for communication with the control-plane)
  2. Very common in OCP 3.11 clusters as users forget to monitor for new certificate-signing-request(CSR’s) (renewed yearly by default), or the certificates get out of sync with the master certs
  3. Improvements have been made in OCP 4.x to automate, but similar issues occurs surprisingly often

4. If there are pending CSR’s the approving them in often all that’s needed.

5. Otherwise manual intervention is required to force the CSR creation and certificate renewal…

6. Manual OCP 3.11 node certificate renewal is well documented but sometimes challenging depending on whether the control-plane is down.

7. Simplest case requires a valid bootstrap.kubeconfig file and then…

https://access.redhat.com/solutions/3782361
https://access.redhat.com/solutions/3670231

Certificate renewal is rare in OCP 4.x but if incase require below are steps

https://access.redhat.com/solutions/4923031

There are chance that control-plane node is unable to reach kubelet or visa versa (Networking Issue)

  1. Problem will ensure if the control-plane can’t reach a node’s kubelet or the kubelet can’t reach the control-plane.
    Login to one of the control-plane node and perform the below steps

2. Ensure firewall and security group configuration is correct, allows traffic on port 10250

3. Ensure SDN components are running on the node and working as expected

4. A sosreport can be very help for futher troubleshooting of node

There might be an chances that there is no storage/memory and network issues than below might be the issue

Container runtime issues

  1. Check runtime status and logs; see if any containers are running or can be started

CRIO: https://access.redhat.com/solutions/5470441
Docker: https://access.redhat.com/solutions/3258011

Appendix

Init containers continually restarting

  1. In OCP 3.11 (and possibly 4.x) some deployments with more than one init container may randomly restart, resulting in errors:
  2. This is likely due to the maximum-dead-containers and/or maximum-dead-containers-per-container kubelet setting being too low

redhat solution: https://access.redhat.com/solutions/3517821

Happy learning…

If you like the content is helpful please do share with others, lets spread the knowledge. :)

References:

https://docs.openshift.com/container-platform/3.11/admin_guide/garbage_collection.html

👋 If you find this helpful, please click the clap 👏 button below a few times to show your support for the author 👇

🚀Join FAUN Developer Community & Get Similar Stories in your Inbox Each Week

--

--

DevOps Practitioner (CKA certified , RHOCP Certified, Azure Certified on az-104,az-400,az-303.)