Resolve etcdMembersDown

#devops #openshift #troubleshooting #oc

Resolve etcdMembersDown

This alert fires when one or more etcd member goes down and evaluates the number of etcd members that are currently down. Often, this alert was observed as part of a cluster upgrade when a master node is being upgraded and requires a reboot.

In etcd a majority of (n/2)+1 has to agree on membership changes or key-value upgrade proposals. With this approach, a split-brain inconsistency can be avoided. In the case that only one member is down in a 3-member cluster, it still can make forward progress. Because the quorum is 2 and 2 members are still alive. However, when more members are down, the cluster becomes unrecoverable.

Investigation

oc get nodes -l node-role.kubernetes.io/master=
NAME                        STATUS   ROLES                         AGE    VERSION
ocpmn01.ocpcl.suyi.local   Ready    control-plane,master,worker   184d   v1.25.14+20cda61
ocpmn02.ocpcl.suyi.local   Ready    control-plane,master,worker   184d   v1.25.14+20cda61
ocpmn03.ocpcl.suyi.local   Ready    control-plane,master,worker   184d   v1.25.14+20cda61

Check if an upgrade is in progress.

PS C:\Users\suyash.sambhare> oc adm upgrade
Cluster version is 4.12.40

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.12 (available channels: candidate-4.12, candidate-4.13, eus-4.12, eus-4.14, fast-4.12, fast-4.13, stable-4.12, stable-4.13)

Recommended updates:

  VERSION     IMAGE
  4.12.47     quay.io/openshift-release-dev/ocp-release@sha256:fcc9920ba10ebb02c69bdd9cd597273260eeec1b22e9ef9986a47f4874a21253
  4.12.46     quay.io/openshift-release-dev/ocp-release@sha256:2dda17736b7b747b463b040cb3b7abba9c4174b0922e2fd84127e3887f6d69c5
  4.12.45     quay.io/openshift-release-dev/ocp-release@sha256:faf0aebc0abce8890e046eecfa392c24bc24f6c49146c45447fb0977e692db6e
  4.12.44     quay.io/openshift-release-dev/ocp-release@sha256:304f37f9d7aa290252951751c5bf03a97085f77b4bcde0ed8a2fa455e9600e68
  4.12.43     quay.io/openshift-release-dev/ocp-release@sha256:10221b3f8f23fe625f3aab8f1e3297eaa340efc64fb5eff8d46cc8461888804e
  4.12.42     quay.io/openshift-release-dev/ocp-release@sha256:a52419c56d84f2953ddaa121e89a3d806902af4538503d8cf229b1e8a14f8902
  4.12.41     quay.io/openshift-release-dev/ocp-release@sha256:59c93fdfff4ecca2ca6d6bb0ec722bca2bb08152252ae10ce486a9fc80c82dcf

In case there is no upgrade going on, but there is a change in the machineconfig for the master pool causing a rolling reboot of each master node, this alert can be triggered as well. We can check if the machineconfiguration.openshift.io/state : Working annotation is set for any of the master nodes. This is the case when the machine-config-operator (MCO) is working on it.

oc get nodes -l node-role.kubernetes.io/master= -o template --template='{{range .items}}{{"===> node:> "}}{{.metadata.name}}{{"\\n"}}{{range $k, $v := .metadata.annotations}}{{println $k ":" $v}}{{end}}{{"\\n"}}{{end}}'

===> node:> ocpmn01.ocpcl.suyi.local\ncsi.volume.kubernetes.io/nodeid : {"csi.vsphere.vmware.com":"ocpmn01.ocpcl.suyi.local"}
k8s.ovn.org/host-addresses : ["10.10.11.21"]
k8s.ovn.org/l3-gateway-config : {"default":{"mode":"shared","interface-id":"br-ex_ocpmn01.ocpcl.suyi.local","mac-address":"00:50:56:86:47:57","ip-addresses":["10.10.11.21/24"],"ip-address":"10.10.11.21/24","next-hops":["10.10.11.1"],"next-hop":"10.10.11.1","node-port-enable":"true","vlan-id":"0"}}
k8s.ovn.org/node-chassis-id : 5363b9d6-2c92-4f3b-ba91-7d2a4a2ca173
k8s.ovn.org/node-gateway-router-lrp-ifaddr : {"ipv4":"100.64.0.2/16"}
k8s.ovn.org/node-mgmt-port-mac-address : 3e:1d:b5:6b:b8:73
k8s.ovn.org/node-primary-ifaddr : {"ipv4":"10.10.11.21/24"}
k8s.ovn.org/node-subnets : {"default":"10.130.0.0/23"}
machineconfiguration.openshift.io/controlPlaneTopology : HighlyAvailable
machineconfiguration.openshift.io/currentConfig : rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/desiredConfig : rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/desiredDrain : uncordon-rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/lastAppliedDrain : uncordon-rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/reason :
machineconfiguration.openshift.io/ssh : accessed
machineconfiguration.openshift.io/state : Done
volumes.kubernetes.io/controller-managed-attach-detach : true
\n===> node:> ocpmn02.ocpcl.suyi.local\ncsi.volume.kubernetes.io/nodeid : {"csi.vsphere.vmware.com":"ocpmn02.ocpcl.suyi.local"}
k8s.ovn.org/host-addresses : ["10.10.11.22"]
k8s.ovn.org/l3-gateway-config : {"default":{"mode":"shared","interface-id":"br-ex_ocpmn02.ocpcl.suyi.local","mac-address":"00:50:56:86:00:d9","ip-addresses":["10.10.11.22/24"],"ip-address":"10.10.11.22/24","next-hops":["10.10.11.1"],"next-hop":"10.10.11.1","node-port-enable":"true","vlan-id":"0"}}
k8s.ovn.org/node-chassis-id : 37537c28-71c6-4443-bdb0-8080e91cddc3
k8s.ovn.org/node-gateway-router-lrp-ifaddr : {"ipv4":"100.64.0.3/16"}
k8s.ovn.org/node-mgmt-port-mac-address : 36:47:4f:4d:cd:eb
k8s.ovn.org/node-primary-ifaddr : {"ipv4":"10.10.11.22/24"}
k8s.ovn.org/node-subnets : {"default":"10.128.0.0/23"}
machineconfiguration.openshift.io/controlPlaneTopology : HighlyAvailable
machineconfiguration.openshift.io/currentConfig : rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/desiredConfig : rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/desiredDrain : uncordon-rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/lastAppliedDrain : uncordon-rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/reason :
machineconfiguration.openshift.io/ssh : accessed
machineconfiguration.openshift.io/state : Done
volumes.kubernetes.io/controller-managed-attach-detach : true
\n===> node:> ocpmn03.ocpcl.suyi.local\ncsi.volume.kubernetes.io/nodeid : {"csi.vsphere.vmware.com":"ocpmn03.ocpcl.suyi.local"}
k8s.ovn.org/host-addresses : ["10.10.11.23"]
k8s.ovn.org/l3-gateway-config : {"default":{"mode":"shared","interface-id":"br-ex_ocpmn03.ocpcl.suyi.local","mac-address":"00:50:56:86:81:1a","ip-addresses":["10.10.11.23/24"],"ip-address":"10.10.11.23/24","next-hops":["10.10.11.1"],"next-hop":"10.10.11.1","node-port-enable":"true","vlan-id":"0"}}
k8s.ovn.org/node-chassis-id : 67e105a0-4bc2-452e-9e7d-e40c91131445
k8s.ovn.org/node-gateway-router-lrp-ifaddr : {"ipv4":"100.64.0.4/16"}
k8s.ovn.org/node-mgmt-port-mac-address : 4a:cc:35:4f:52:2d
k8s.ovn.org/node-primary-ifaddr : {"ipv4":"10.10.11.23/24"}
k8s.ovn.org/node-subnets : {"default":"10.129.0.0/23"}
machineconfiguration.openshift.io/controlPlaneTopology : HighlyAvailable
machineconfiguration.openshift.io/currentConfig : rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/desiredConfig : rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/desiredDrain : uncordon-rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/lastAppliedDrain : uncordon-rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/reason :
machineconfiguration.openshift.io/ssh : accessed
machineconfiguration.openshift.io/state : Done
volumes.kubernetes.io/controller-managed-attach-detach : true
\n[ocpadmin@bastion ~]$

Health check for etcd

To run etcdctl commands, we need to rsh into the etcdctl container of any etcd pod.

oc rsh -c etcdctl -n openshift-etcd $(oc get pod -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }')

Validate that the etcdctl command is available:

[ocpadmin@bastion ~]$ oc rsh -c etcdctl -n openshift-etcd $(oc get pod -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }')
sh-4.4# etcdctl version
etcdctl version: 3.5.9
API version: 3.5
sh-4.4#

Run the following command to get the health of etcd:

sh-4.4# etcdctl endpoint health -w table
+---------------------------+--------+-------------+-------+
|         ENDPOINT          | HEALTH |    TOOK     | ERROR |
+---------------------------+--------+-------------+-------+
| https://10.10.11.21:2379 |   true |  21.36776ms |       |
| https://10.10.11.22:2379 |   true | 16.876221ms |       |
| https://10.10.11.23:2379 |   true | 23.713666ms |       |
+---------------------------+--------+-------------+-------+

Mitigation

If an upgrade is in progress, the alert may automatically resolve in some time when the master node comes up again. If MCO is not working on the master node, check the cloud provider to verify if the master node instances are running or not.

In the case when you are running on AWS, the AWS instance retirement might need a manual reboot of the master node.

Ref: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdMembersDown.md

Debug School

Resolve etcdMembersDown