Resolve etcdMembersDown
This alert fires when one or more etcd member goes down and evaluates the number of etcd members that are currently down. Often, this alert was observed as part of a cluster upgrade when a master node is being upgraded and requires a reboot.
In etcd a majority of (n/2)+1 has to agree on membership changes or key-value upgrade proposals. With this approach, a split-brain inconsistency can be avoided. In the case that only one member is down in a 3-member cluster, it still can make forward progress. Because the quorum is 2 and 2 members are still alive. However, when more members are down, the cluster becomes unrecoverable.
Investigation
Login to the cluster. Check the health of master nodes if any of them are in NotReady
state or not.
oc get nodes -l node-role.kubernetes.io/master=
NAME STATUS ROLES AGE VERSION
ocpmn01.ocpcl.suyi.local Ready control-plane,master,worker 184d v1.25.14+20cda61
ocpmn02.ocpcl.suyi.local Ready control-plane,master,worker 184d v1.25.14+20cda61
ocpmn03.ocpcl.suyi.local Ready control-plane,master,worker 184d v1.25.14+20cda61
Check if an upgrade is in progress.
PS C:\Users\suyash.sambhare> oc adm upgrade
Cluster version is 4.12.40
Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.12 (available channels: candidate-4.12, candidate-4.13, eus-4.12, eus-4.14, fast-4.12, fast-4.13, stable-4.12, stable-4.13)
Recommended updates:
VERSION IMAGE
4.12.47 quay.io/openshift-release-dev/ocp-release@sha256:fcc9920ba10ebb02c69bdd9cd597273260eeec1b22e9ef9986a47f4874a21253
4.12.46 quay.io/openshift-release-dev/ocp-release@sha256:2dda17736b7b747b463b040cb3b7abba9c4174b0922e2fd84127e3887f6d69c5
4.12.45 quay.io/openshift-release-dev/ocp-release@sha256:faf0aebc0abce8890e046eecfa392c24bc24f6c49146c45447fb0977e692db6e
4.12.44 quay.io/openshift-release-dev/ocp-release@sha256:304f37f9d7aa290252951751c5bf03a97085f77b4bcde0ed8a2fa455e9600e68
4.12.43 quay.io/openshift-release-dev/ocp-release@sha256:10221b3f8f23fe625f3aab8f1e3297eaa340efc64fb5eff8d46cc8461888804e
4.12.42 quay.io/openshift-release-dev/ocp-release@sha256:a52419c56d84f2953ddaa121e89a3d806902af4538503d8cf229b1e8a14f8902
4.12.41 quay.io/openshift-release-dev/ocp-release@sha256:59c93fdfff4ecca2ca6d6bb0ec722bca2bb08152252ae10ce486a9fc80c82dcf
In case there is no upgrade going on, but there is a change in the machineconfig
for the master pool causing a rolling reboot of each master node, this alert can be triggered as well. We can check if the machineconfiguration.openshift.io/state : Working
annotation is set for any of the master nodes. This is the case when the machine-config-operator (MCO) is working on it.
oc get nodes -l node-role.kubernetes.io/master= -o template --template='{{range .items}}{{"===> node:> "}}{{.metadata.name}}{{"\\n"}}{{range $k, $v := .metadata.annotations}}{{println $k ":" $v}}{{end}}{{"\\n"}}{{end}}'
===> node:> ocpmn01.ocpcl.suyi.local\ncsi.volume.kubernetes.io/nodeid : {"csi.vsphere.vmware.com":"ocpmn01.ocpcl.suyi.local"}
k8s.ovn.org/host-addresses : ["10.10.11.21"]
k8s.ovn.org/l3-gateway-config : {"default":{"mode":"shared","interface-id":"br-ex_ocpmn01.ocpcl.suyi.local","mac-address":"00:50:56:86:47:57","ip-addresses":["10.10.11.21/24"],"ip-address":"10.10.11.21/24","next-hops":["10.10.11.1"],"next-hop":"10.10.11.1","node-port-enable":"true","vlan-id":"0"}}
k8s.ovn.org/node-chassis-id : 5363b9d6-2c92-4f3b-ba91-7d2a4a2ca173
k8s.ovn.org/node-gateway-router-lrp-ifaddr : {"ipv4":"100.64.0.2/16"}
k8s.ovn.org/node-mgmt-port-mac-address : 3e:1d:b5:6b:b8:73
k8s.ovn.org/node-primary-ifaddr : {"ipv4":"10.10.11.21/24"}
k8s.ovn.org/node-subnets : {"default":"10.130.0.0/23"}
machineconfiguration.openshift.io/controlPlaneTopology : HighlyAvailable
machineconfiguration.openshift.io/currentConfig : rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/desiredConfig : rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/desiredDrain : uncordon-rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/lastAppliedDrain : uncordon-rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/reason :
machineconfiguration.openshift.io/ssh : accessed
machineconfiguration.openshift.io/state : Done
volumes.kubernetes.io/controller-managed-attach-detach : true
\n===> node:> ocpmn02.ocpcl.suyi.local\ncsi.volume.kubernetes.io/nodeid : {"csi.vsphere.vmware.com":"ocpmn02.ocpcl.suyi.local"}
k8s.ovn.org/host-addresses : ["10.10.11.22"]
k8s.ovn.org/l3-gateway-config : {"default":{"mode":"shared","interface-id":"br-ex_ocpmn02.ocpcl.suyi.local","mac-address":"00:50:56:86:00:d9","ip-addresses":["10.10.11.22/24"],"ip-address":"10.10.11.22/24","next-hops":["10.10.11.1"],"next-hop":"10.10.11.1","node-port-enable":"true","vlan-id":"0"}}
k8s.ovn.org/node-chassis-id : 37537c28-71c6-4443-bdb0-8080e91cddc3
k8s.ovn.org/node-gateway-router-lrp-ifaddr : {"ipv4":"100.64.0.3/16"}
k8s.ovn.org/node-mgmt-port-mac-address : 36:47:4f:4d:cd:eb
k8s.ovn.org/node-primary-ifaddr : {"ipv4":"10.10.11.22/24"}
k8s.ovn.org/node-subnets : {"default":"10.128.0.0/23"}
machineconfiguration.openshift.io/controlPlaneTopology : HighlyAvailable
machineconfiguration.openshift.io/currentConfig : rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/desiredConfig : rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/desiredDrain : uncordon-rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/lastAppliedDrain : uncordon-rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/reason :
machineconfiguration.openshift.io/ssh : accessed
machineconfiguration.openshift.io/state : Done
volumes.kubernetes.io/controller-managed-attach-detach : true
\n===> node:> ocpmn03.ocpcl.suyi.local\ncsi.volume.kubernetes.io/nodeid : {"csi.vsphere.vmware.com":"ocpmn03.ocpcl.suyi.local"}
k8s.ovn.org/host-addresses : ["10.10.11.23"]
k8s.ovn.org/l3-gateway-config : {"default":{"mode":"shared","interface-id":"br-ex_ocpmn03.ocpcl.suyi.local","mac-address":"00:50:56:86:81:1a","ip-addresses":["10.10.11.23/24"],"ip-address":"10.10.11.23/24","next-hops":["10.10.11.1"],"next-hop":"10.10.11.1","node-port-enable":"true","vlan-id":"0"}}
k8s.ovn.org/node-chassis-id : 67e105a0-4bc2-452e-9e7d-e40c91131445
k8s.ovn.org/node-gateway-router-lrp-ifaddr : {"ipv4":"100.64.0.4/16"}
k8s.ovn.org/node-mgmt-port-mac-address : 4a:cc:35:4f:52:2d
k8s.ovn.org/node-primary-ifaddr : {"ipv4":"10.10.11.23/24"}
k8s.ovn.org/node-subnets : {"default":"10.129.0.0/23"}
machineconfiguration.openshift.io/controlPlaneTopology : HighlyAvailable
machineconfiguration.openshift.io/currentConfig : rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/desiredConfig : rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/desiredDrain : uncordon-rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/lastAppliedDrain : uncordon-rendered-master-86ac83351a775a55fe067747839d8b41
machineconfiguration.openshift.io/reason :
machineconfiguration.openshift.io/ssh : accessed
machineconfiguration.openshift.io/state : Done
volumes.kubernetes.io/controller-managed-attach-detach : true
\n[ocpadmin@bastion ~]$
Health check for etcd
To run etcdctl
commands, we need to rsh
into the etcdctl
container of any etcd pod.
oc rsh -c etcdctl -n openshift-etcd $(oc get pod -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }')
Validate that the etcdctl
command is available:
[ocpadmin@bastion ~]$ oc rsh -c etcdctl -n openshift-etcd $(oc get pod -l app=etcd -oname -n openshift-etcd | awk -F"/" 'NR==1{ print $2 }')
sh-4.4# etcdctl version
etcdctl version: 3.5.9
API version: 3.5
sh-4.4#
Run the following command to get the health of etcd:
sh-4.4# etcdctl endpoint health -w table
+---------------------------+--------+-------------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+---------------------------+--------+-------------+-------+
| https://10.10.11.21:2379 | true | 21.36776ms | |
| https://10.10.11.22:2379 | true | 16.876221ms | |
| https://10.10.11.23:2379 | true | 23.713666ms | |
+---------------------------+--------+-------------+-------+
Mitigation
If an upgrade is in progress, the alert may automatically resolve in some time when the master node comes up again. If MCO is not working on the master node, check the cloud provider to verify if the master node instances are running or not.
In the case when you are running on AWS, the AWS instance retirement might need a manual reboot of the master node.
Ref: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdMembersDown.md
Top comments (0)