Kubernetes Taints and Tolerations
In this article, I will describe taints and tolerations along with bit of details about pods placement on specific worker nodes while avoiding the nodes where you don’t want your pods to get scheduled.
Many organization or teams often need multi-tenant, heterogeneous Kubernetes clusters to meet their user’s application needs. They may also need to address certain special constraints on the Kubernetes cluster, for example, some pods may require special hardware, colocation with other specific pods, or isolation from other application pods. There are many options available for placing those application containers into different, separate nodes and node groups, One of which is through the use of taints and tolerations.
Taints and Tolerations — Concepts
Taints and tolerations is a mechanism that allows you/appteam to ensure that pods are not placed or scheduled on inappropriate nodes. Taints are added to nodes(there can be more than one taints on single node), while tolerations are defined in the pod(more than tolerations can defined in single pod) specification.
You can put multiple taints on the same node and multiple tolerations on the same pod. The way Kubernetes processes multiple taints and tolerations is like a filter: start with all of a node’s taints, then ignore the ones for which the pod has a matching toleration; the remaining un-ignored taints have the indicated effects on the pod.
For example, most Kubernetes distributions will automatically taint the master nodes so that one of the pods that manages the control plane is scheduled onto them and not any other data/app pods deployed by users. This ensures that the master nodes are dedicated to run control plane pods.
A taint can produce three possible effects:
NoSchedule: The Kubernetes scheduler will only allow scheduling pods that have tolerations for the tainted nodes.
PreferNoSchedule: The Kubernetes scheduler will try to avoid scheduling pods that don’t have tolerations for the tainted nodes.
NoExecute: Kubernetes will evict the running pods from the nodes if the pods don’t have tolerations for the tainted nodes.
kubectl taint nodes node1 key1=value1:NoSchedule
kubectl taint nodes node1 key1=value1:NoExecute
kubectl taint nodes node1 key2=value2:NoSchedule
And a pod has two tolerations:
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoExecute"
In this case, the pod will not be able to schedule pods onto the node, because there is no toleration matching the third taint. But pod will keep running if it is already placed on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the pod.
Normally, if a taint with effect NoExecute
is added to a node, then any pods that do not tolerate the taint will be evicted immediately, and pods that do tolerate the taint will never be evicted. However, a toleration with NoExecute
effect can specify an optional tolerationSeconds
field that dictates how long the pod will stay bound to the node after the taint is added. For example,
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoExecute"
tolerationSeconds: 3600
means that if this pod is running and a matching taint is added to the node, then the pod will stay bound to the node for 3600 seconds, and then be evicted. If the taint is removed before that time, the pod will keep running on to the node.
Use Cases for Taints and Tolerations
Dedicated Nodes
If you need to dedicate a group of worker nodes for a set of users, you can add a taint to those nodes, such as by using this command:
Then add tolerations of the taint in that user group’s pods so they can run on tainted nodes. To further ensure that pods only get scheduled on that set of tainted nodes, you can also add a label to those nodes, e.g., dedicated=groupName
. Then use NodeSelector in the deployment/pod spec, which will make sure that pods from the user group are bound to the node group and don’t run anywhere else. and the admission controller should additionally add a node affinity to require that the pods can only schedule onto nodes labeled with dedicated=groupName
Nodes with Special Hardware
If there are worker nodes with special hardware, you need to make sure that normal application pods that don’t need the special hardware don’t run on those worker nodes. Do this by adding a taint to those nodes as follows:
Later on, when the pods requiring special hardware can be run on those worker nodes by adding tolerations for the above taint.
Taint-Based Evictions
A taint with NoExecute
effect will evict the existing pods on the node if pods has no toleration for that taint.
- pods that do not tolerate the taint are evicted immediately
- pods that tolerate the taint without specifying
tolerationSeconds
in their toleration specification remain bound forever - pods that tolerate the taint with a specified
tolerationSeconds
remain bound for the specified amount of time
The k8s node controller will automatically add this kind of taint to a node in some scenarios so that pods can be evicted immediately and the node is “drained” (have all of its pods evicted).For example, scenario where a network outage causes a node to be unreachable from the controller. In this scenario, it would be good to move all of the pods off the node so that they can get rescheduled to other nodes. The node controller takes this action automatically to avoid the need for manual intervention.
The following are built-in taints:
node.kubernetes.io/not-ready
: Node is not ready. This corresponds to the NodeConditionReady
being "False
".node.kubernetes.io/unreachable
: Node is unreachable from the node controller. This corresponds to the NodeConditionReady
being "Unknown
".node.kubernetes.io/memory-pressure
: Node has memory pressure.node.kubernetes.io/disk-pressure
: Node has disk pressure. In case of High disk utilization on nodes, it can cause slowness for application so its better to relocate pods.node.kubernetes.io/pid-pressure
: Node has PID pressure.node.kubernetes.io/network-unavailable
: Node's network is unavailable.node.kubernetes.io/unschedulable
: Node is unschedulable.node.cloudprovider.kubernetes.io/uninitialized
: Node is unschedulable. Any other reason that will make the node inappropriate for hosting pods, for example if the cluster is being scaled down and the node is being removed.
How to Use Taints and Tolerations
Let me now present a scenario to help you easily understand taints and tolerations. Let’s start with a k8s cluster that has worker nodes categorized into multiple groups, such as front-end nodes and back-end nodes, infra nodes. Let’s assume that we need to deploy the front-end application pods so that they are placed only on front-end worker nodes and not on the back-end worker nodes. Will also ensure that new pods are not scheduled into master nodes because master nodes run control plane components such as etcd.
Below are the list of already tainted nodes by the k8s default during the installation
kubectl get nodes -o=custom-columns=NodeName:.metadata.name,TaintKey:.spec.taints[*].key,TaintValue:.spec.taints[*].value,TaintEffect:.spec.taints[*].effect
NodeName TaintKey TaintValue TaintEffect
cluster01-master-1 node-role.kubernetes.io/controlplane,node-role.kubernetes.io/etcd true,true NoSchedule,NoExecute
cluster01-master-2 node-role.kubernetes.io/controlplane,node-role.kubernetes.io/etcd true,true NoSchedule,NoExecute
cluster01-master-3 node-role.kubernetes.io/controlplane,node-role.kubernetes.io/etcd true,true NoSchedule,NoExecute
cluster01-worker-1 <none> <none> <none>
From the output above, we noticed that the master nodes are already tainted by the k8s installation, so that no application pods land on them until intentionally configured by the user to be placed on master nodes by adding tolerations for those taints. also see the output of the worker node that has no taints associated with it, Now let me taint the worker node, So that only pods will get scheduled on this nodes who has toleration for this.
kubectl taint nodes cluster01-worker-1 app=frontend:NoSchedule
node/cluster01-worker-1 tainted
The above taint has a key name app
, with a value frontend
, and has the effect of NoSchedule
, which means that no pod will be placed on this node until the pod has defined a toleration for the taint.
Hope this blog will help you to understand taints and toleration…..
Happy Reading….