Suyash Sambhare

Posted on Jun 17, 2024

Node Reboot Options

#openshift #devops #redhat #linux

Best Practices to follow for Node Reboots in Openshift

To perform a node reboot without disrupting applications on the platform, start by evacuating the pods. For highly available pods managed by the routing tier, no additional steps are necessary. However, for other pods, such as databases, that require storage, ensure they can continue operating even if one pod temporarily goes offline. While the approach to resiliency varies for stateful pods in different applications, it's crucial to configure the scheduler with node anti-affinity rules to distribute pods across available nodes evenly.

Handling nodes running critical infrastructure such as the router or registry follows the same evacuation process.

Nodes running critical infrastructure

Make sure there are at least three nodes available to operate router pods, registry pods, and monitoring pods before rebooting any nodes that house vital OpenShift Container Platform infrastructure components.

How service interruptions can occur

There are some cases with applications running on the OpenShift Container Platform when only two nodes are available where service interruptions can occur:

Node A is marked unschedulable and all pods are evacuated.
The registry pod running on that node is now redeployed on node B. Node B runs both registry pods.
Node B is now marked unschedulable and is evacuated.

The service exposing the two pod endpoints on node B loses all endpoints, for a brief period, until they are redeployed to node A.

This procedure does not cause a service disruption when three nodes are used for infrastructure components. Nevertheless, the final node to be evacuated and reinserted into rotation lacks a registry pod because of pod scheduling. There are two registry pods on one of the other nodes. Use pod anti-affinity to stop the scheduler from finding two registry pods on the same node so that the third registry pod can be scheduled on the last node.

Rebooting a node using pod anti-affinity

Compared to node anti-affinity, pod anti-affinity differs slightly. If there are no other viable pod deployment locations, node anti-affinity may be broken. It is possible to set pod anti-affinity to either necessary or preferred.

With this setup, the container image registry pod cannot operate on the second infrastructure node if there are only two available and one of them is rebooted. The pod is marked as unready by oc get pods until a suitable node becomes available. The next node can be resumed after a node becomes available and every pod is again in a ready condition.

Reboot a node using pod anti-affinity

To reboot a node using pod anti-affinity follow the below procedure:

Edit the node specification to configure pod anti-affinity

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-antiaffinity
spec:
  affinity:
    podAntiAffinity: 
      preferredDuringSchedulingIgnoredDuringExecution: 
      - weight: 100 
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: registry 
              operator: In 
              values:
              - default
          topologyKey: kubernetes.io/hostname

Description:

podAntiAffinity - Defines a preferred rule.
preferredDuringSchedulingIgnoredDuringExecution - Defines a preferred rule.
weight - Specifies a weight for a preferred rule. The node with the highest weight is preferred.
key - Description of the pod label that determines when the anti-affinity rule applies. Specify a key and value for the label.
operator - The operator represents the relationship between the label on the existing pod and the set of values in the matchExpression parameters in the specification for the new pod. Can be In, NotIn, Exists, or DoesNotExist.

In this example, the container image registry pod has a label of registry=default. Pod anti-affinity can use any Kubernetes match expression.

Enable the MatchInterPodAffinity scheduler predicate in the scheduling policy file.
Perform a graceful restart of the node.

Reboot nodes running routers

A host port is typically exposed by a pod running an OpenShift Container Platform router.

Pod anti-affinity is established and no router pod using the same port can operate on the same node thanks to the PodFitsPorts scheduler predicate. Nothing further is required if the routers' high availability is based on IP failover.

When a router pod depends on an outside service for high availability, like AWS Elastic Load Balancing, that service is accountable for responding to router pod restarts.

A router pod might, on rare occasions, not have a host port specified. In those situations, it's crucial to adhere to the infrastructure nodes' suggested restart procedure.

Gracefully rebooting a node

Before rebooting a node, it is recommended to back up etcd data to avoid any data loss on the node.

After cordoning and draining the node, the oc adm commands might not be available for single-node OpenShift clusters that require users to run the oc login command to operate the cluster rather than having the certificates in the kubeconfig file. This is because the cordon is preventing the openshift-oauth-apiserver pod from operating. SSH can be used to gain access to the nodes, as shown in the steps below.

Pods cordoned and drained on a single-node OpenShift cluster cannot be rescheduled. But doing so provides the pods time to halt and release related resources, particularly your workload pods.

Perform a graceful restart of a node

Follow the below steps to perform a graceful restart of a node

Mark the node as unschedulable: $ oc adm cordon ocpmn01.ocpcl.suyi.local
Drain the node to remove all the running pods: $ oc adm drain ocpmn01.ocpcl.suyi.local --ignore-daemonsets --delete-emptydir-data --force
If you receive errors that pods associated with custom pod disruption budgets (PDB) cannot be evicted such as error when evicting pods/"rails-postgresql-example-1-72v2w" -n "rails" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. Then run the drain command again, adding the disable-eviction flag, which bypasses the PDB checks - $ oc adm drain ocpmn01.ocpcl.suyi.local --ignore-daemonsets --delete-emptydir-data --force --disable-eviction
Access the node in debug mode: $ oc debug node/ocpmn01.ocpcl.suyi.local
Change your root directory to /host: $ chroot /host
Restart the node: $ systemctl reboot
The node enters the NotReady state.
After the reboot is complete, mark the node as schedulable by running the following command: $ oc adm uncordon ocpmn01.ocpcl.suyi.local

Note: With some single-node OpenShift clusters, the oc commands might not be available after you cordon and drain the node because the openshift-oauth-apiserver pod is not running. You can use SSH to connect to the node and perform the reboot. $ ssh core@ocpmn01.ocpcl.suyi.local

Verify that the node is ready

[ocpadmin@bastion suyi]$ oc get node ocpmn01.ocpcl.suyi.local

NAME                        STATUS   ROLES                  AGE    VERSION
ocpmn01.ocpcl.suyi.local   Ready    control-plane,master    21d   v1.25.14+20cda61
ocpwn01.ocpcl.suyi.local   Ready    app,worker              21d   v1.25.14+20cda61
ocpwn02.ocpcl.suyi.local   Ready    infra,worker            21d   v1.25.14+20cda61

Ref: https://docs.openshift.com/container-platform/4.7/nodes/nodes/nodes-nodes-rebooting.html

Debug School