How to Rebuild Master Node in OpenShift 4.x ?

This article will be helpful for those who all are working with Redhat OpenShift platform , and below steps will be helpful if you want to rebuild or replace master node with new node in the cluster.

Before we start node rebuild activity lets talk about the etcd backup and its steps.

OCP 4.x comes along with ready made backup scripts that will backup the etcd state. Even though master-0 is already unavailable, it is nice to have a backup just in case any additional problems arise (i.e: human error) and the cluster ends up in a worst-state.

Steps to take etcd backup

login to once of the running master node

$oc debug node/master-0

sh-4.2# chroot /host

Run the cluster-backup.sh script and pass in the location to save the backup to.

sh-4.4# /usr/local/bin/cluster-backup.sh /home/core/assets/backup

Now since we have taken the etcd backup we can start verifying the other part of cluster

PreChecks:

Before rebuilding the master node, check if everything is OK from nodes/machines point of view and check the cluster operators state. Node/machines should all be in Ready/Running state and all cluster operators should be in available, should not be progressing and not degraded state

$oc get co

Check the quorum and determine which master node is currently the leader, below command will help you to identify the leader:

$oc get pods -n openshift-etcd -o wide | grep -v etcd-quorum-guard | grep etcd

$oc rsh -n openshift-etcd etcd-master-0 etcdctl endpoint status — cluster -w table

Actual Steps:

Now identify the master node which you want to rebuild/recreate.

$oc get nodes | grep master

Export master machine definition file in yaml format

Extract definition of machine yaml to replace. It will be used as a basis for new master Machine.

$oc get machines -n openshift-machine-api master-0 -o yaml > new-master-machine.yaml

Modify the Machine manifest file which you have exported , Cleanup/update everything specific to previous Machine

status section : #remove completely
annotations : remove as needed
change name of machine to a new one

Example yaml:

Once you have you have your machine manifest file ready delete the machine which you want to replace.

$oc delete machine -n openshift-machine-api <maste-node-name>

Once you trigger the deletion Don’t panic as it takes around ~25–30 minutes.

Note: Openshift UI and CLI will not be accessible while this delete is in progress.

Once node deleted from cluster follow the below steps to cleanup the ectd

Note: Cleanup can only be done after machine has been deleted, else etcd member is recreated automatically by openshift-etcd operator.

YOu can use one of the remaining/healthy etcd pod to cleanup the deleted node ectd member from the table

$oc rsh -n openshift-etcd <etcd-pod-name>

# Once you logged in into the pod list existing members

sh-4.4# etcdctl member list -w table

Fig: 01

# remove deleted machine member using ID of master-1

sh-4.4# etcdctl member remove e7ewree9a898f9wre4

Member e7ewree9a898f9wre4 removed from cluster e4afsafafaf78387578

# check member is properly deleted

sh-4.4# etcdctl member list -w table

You will also notice that only 2 etcd members exist now. This is because the etcd pod associated with master-1 no longer exists.

Fig: 02

Its time create new machine by using the manifest which we have exported in previous steps

$ oc apply -f new-master-machine.yaml

Monitor new master creation setup..

# wait for new machine to be in status “Running” (~15 minutes for server to be provisioned on IaaS side then switch to Running state)

$ watch oc get machine -n openshift-machine-api

# in case of issue with machine provisioning you can checks logs of machine controller with

$ oc -n openshift-machine-api logs deploy/machine-api-controllers -c machine-controller

# wait for new etcd to be “Ready”

# Note that all previous etcd pods will be restarted, make sure they are Ready too before doing any other action

$ watch “oc get pods -n openshift-etcd -o wide | grep -v etcd-quorum-guard | grep etcd”

# same make sure kube-apiserver pod have all been restarted successfully

$ oc -n openshift-kube-apiserver get pod

# make sure all calico pods are running if you have calico setup if not ignore this step

$ oc -n calico-system get pod -o wide

# wait until all cluster operators are in Available=True and Progressing=False

$ watch oc get co

Note: Check also in Openshift console overview that control plane status is all green

Fig: 03

Check if new ETCD member has been added properly

$oc rsh -n openshift-etcd <etcd-pod-name>

sh-4.4# etcdctl member list -w table

Fig: 04

Check the master nodes

Fig: 05

Once new master is fully up and ready/running and to avoid leakage of resources, some secrets linked to former master node needs to be deleted manually.

# list secret of old master

$ oc get secrets -n openshift-etcd | grep master-1

etcd-peer-master-1 kubernetes.io/tls 2 2d16h
etcd-serving-master-1 kubernetes.io/tls 2 2d16h
etcd-serving-metrics-master-1 kubernetes.io/tls 2 2d16h

# delete secret manually one by one

$oc delete secret -n openshift-etcd etcd-peer-master-1

$oc delete secret -n openshift-etcd etcd-serving-master-1

$oc delete secret -n openshift-etcd etcd-serving-metrics-master-1

You should be good now…

Note: Above steps are based on my experience and practically followed in my local setup, so before using please verify the steps in your test setup.

Happy Learning..

If you feel content is good, please do share with other to get benefitted from it.

--

--

DevOps Practitioner (CKA certified , RHOCP Certified, Azure Certified on az-104,az-400,az-303.)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Kamlesh Prajapati

DevOps Practitioner (CKA certified , RHOCP Certified, Azure Certified on az-104,az-400,az-303.)