Debug School

Cover image for Verifying the Health of a Cluster
Suyash Sambhare
Suyash Sambhare

Posted on

Verifying the Health of a Cluster

Verifying the Health of OpenShift Nodes

The following commands display information about the status and health of nodes in an OpenShift cluster:
Display a column with the status of each node. If a node is not Ready, then it cannot communicate with the OpenShift control plane and is effectively dead to the cluster.

PS C:\Users\suyash.sambhare> oc get nodes
NAME                        STATUS   ROLES                         AGE    VERSION
ocpmn01.ocpcl.suyi.local   Ready    control-plane,master,worker   108d   v1.25.8+37a9a08
ocpmn02.ocpcl.suyi.local   Ready    control-plane,master,worker   108d   v1.25.8+37a9a08
ocpmn03.ocpcl.suyi.local   Ready    control-plane,master,worker   108d   v1.25.8+37a9a08
ocpwn01.ocpcl.suyi.local   Ready    app,worker                    107d   v1.25.8+37a9a08
ocpwn02.ocpcl.suyi.local   Ready    app,worker                    107d   v1.25.8+37a9a08
ocpwn03.ocpcl.suyi.local   Ready    infra,worker                  106d   v1.25.8+37a9a08
ocpwn04.ocpcl.suyi.local   Ready    infra,worker                  106d   v1.25.8+37a9a08
Enter fullscreen mode Exit fullscreen mode

Display the current CPU and memory usage of each node. These are actual usage numbers, not the resource requests that the OpenShift scheduler considers as the available and used capacity of the node.

PS C:\Users\suyash.sambhare> oc adm top nodes
NAME                        CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ocpmn01.ocpcl.suyi.local   3646m        23%    30679Mi         65%
ocpmn02.ocpcl.suyi.local   4984m        32%    32720Mi         69%
ocpmn03.ocpcl.suyi.local   6117m        39%    36812Mi         78%
ocpwn01.ocpcl.suyi.local   983m         13%    13249Mi         42%
ocpwn02.ocpcl.suyi.local   1786m        23%    16819Mi         54%
ocpwn03.ocpcl.suyi.local   1073m        3%     21016Mi         44%
ocpwn04.ocpcl.suyi.local   1969m        6%     20950Mi         44%

PS C:\Users\suyash.sambhare>
Enter fullscreen mode Exit fullscreen mode

Display the resources available and used from the scheduler's point of view, and other information. Look for the headings "Capacity", "Allocatable", and "Allocated resources" in the output. The heading "Conditions" indicates whether the node is under memory pressure, disk pressure, or some other condition that would prevent the node from starting new containers.

PS C:\Users\suyash.sambhare> oc describe node ocpmn01.ocpcl.suyi.local
Name:               ocpmn01.ocpcl.suyi.local
Roles:              control-plane,master,worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ocpmn01.ocpcl.suyi.local
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node-role.kubernetes.io/master=
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
Annotations:        csi.volume.kubernetes.io/nodeid: {"csi.vmware.com":"ocpmn01.ocpcl.suyi.local"}
                    k8s.ovn.org/host-addresses: ["196.0.11.21"]
                    k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_ocpmn01.ocpcl.suyi.local","mac-address":"00:50:56:86:47:57","ip-addresses":["196.0.11....
                    k8s.ovn.org/node-chassis-id: 5363b9d6-2c92-4f3b-ba91-7d2a4a2ca173
                    k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.2/16"}
                    k8s.ovn.org/node-mgmt-port-mac-address: 3e:1d:b5:6b:b8:73
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"196.0.11.21/24"}
                    k8s.ovn.org/node-subnets: {"default":"10.127.0.0/23"}
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-master-45d666cdd4fb509173ccaa091b8ca304
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-45d666cdd4fb509173ccaa091b8ca304
                    machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-master-45d666cdd4fb509173ccaa091b8ca304
                    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-master-45d666cdd4fb509173ccaa091b8ca304
                    machineconfiguration.openshift.io/reason:
                    machineconfiguration.openshift.io/ssh: accessed
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 25 Jul 2023 11:53:54 +0530
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ocpmn01.ocpcl.suyi.local
  AcquireTime:     <unset>
  RenewTime:       Fri, 10 Nov 2023 12:38:37 +0530
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason
    Message
  ----             ------  -----------------                 ------------------                ------
    -------
  MemoryPressure   False   Fri, 10 Nov 2023 12:36:02 +0530   Tue, 25 Jul 2023 19:28:04 +0530   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 10 Nov 2023 12:36:02 +0530   Tue, 25 Jul 2023 19:28:04 +0530   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Fri, 10 Nov 2023 12:36:02 +0530   Tue, 25 Jul 2023 19:28:04 +0530   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Fri, 10 Nov 2023 12:36:02 +0530   Tue, 25 Jul 2023 19:28:14 +0530   KubeletReady
    kubelet is posting ready status
Addresses:
  ExternalIP:  196.0.11.21
  InternalIP:  196.0.11.21
  Hostname:    ocpmn01.ocpcl.suyi.local
Capacity:
  cpu:                16
  ephemeral-storage:  261608428Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             49430728Ki
  pods:               250
Allocatable:
  cpu:                15500m
  ephemeral-storage:  240024585022
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             48279752Ki
  pods:               250
System Info:
  Machine ID:                             afd9cfb931fb4eb1817cfb87581de93e
  System UUID:                            b9dd0642-a9e0-a2ee-6712-a4af714eee87
  Boot ID:                                17047a4f-1ccc-47aa-9ad1-2ec1429bdc2c
  Kernel Version:                         4.18.0-372.53.1.el8_6.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 412.86.202305080640-0 (Ootpa)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.25.3-2.rhaos4.12.git592efcd.el8
  Kubelet Version:                        v1.25.8+37a9a08
  Kube-Proxy Version:                     v1.25.8+37a9a08
ProviderID:                               vsphere://4206ddb9-e0a9-eea2-6712-a4af714eee87
Non-terminated Pods:                      (45 in total)
  Namespace                               Name                                                        CPU Requests  CPU Limits   Memory Requests  Memory Limits  Age
  ---------                               ----                                                        ------------  ----------   ---------------  -------------  ---
  stackrox                                admission-control-748db4d84c-gvbfw                          50m (0%)      500m (3%)    100Mi (0%)       500Mi (1%)     17d
  stackrox                                central-db-688b744fb-pdg2z                                  4 (25%)       8 (51%)      8Gi (17%)        16Gi (34%)     17d
  stackrox                                collector-7x8v9                                             70m (0%)      2750m (17%)  340Mi (0%)       3572Mi (7%)    17d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                12989m (83%)   22950m (148%)
  memory             42540Mi (90%)  47592Mi (100%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
Events:              <none>
Enter fullscreen mode Exit fullscreen mode

Reviewing the Cluster Version Resource

The OpenShift installer creates an auth directory containing the kubeconfig and kubeadmin-password files.
Run the oc login command to connect to the cluster with the kubeadmin user.
The password of the kubeadmin user is in the kubeadmin-password file.

[user@host ~]$ oc login --token=sha256~hBL_ZaY9adNxmd9-NuHtu6H0-qyLOct_arrnqdsOW7o --server=https://api.ocpcl.suyi.local:6443
WARNING: Using insecure TLS client config. Setting this option is not supported!

Logged into "https://api.ocpcl.suyi.local:6443" as "kube:admin" using the token provided.

You have access to 12 projects, the list has been suppressed. You can list all projects with 'oc projects'

Using project "default".
Enter fullscreen mode Exit fullscreen mode

Cluster

Cluster Version

ClusterVersion is a custom resource that holds high-level information about the cluster, such as the update channels, the status of the cluster operators, and the cluster version (for example, 4.10.3). Use this resource to declare the version of the cluster you want to run. Defining a new version for the cluster instructs the cluster-version operator to upgrade the cluster to that version.
You can retrieve the cluster version to verify that it is running the desired version, and also to ensure that the cluster uses the right subscription channel.
Run oc get clusterversion to retrieve the cluster version. The output lists the version, including minor releases, the cluster uptime for a given version, and the overall status of the cluster.

PS C:\Users\suyash.sambhare> oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.17   True        True          8d      Unable to apply 4.12.40: the cluster operator storage is not available
Enter fullscreen mode Exit fullscreen mode

Run oc describe clusterversion to obtain more detailed information about the cluster status.

PS C:\Users\suyash.sambhare> oc describe clusterversion
Name:         version
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterVersion
Metadata:
  Creation Timestamp:  2023-07-25T06:06:10Z
  Generation:          6
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:channel:
        f:clusterID:
    Manager:      cluster-bootstrap
    Operation:    Update
    Time:         2023-07-25T06:06:10Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:desiredUpdate:
          .:
          f:image:
          f:version:
    Manager:      Mozilla
    Operation:    Update
    Time:         2023-11-01T08:57:33Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:availableUpdates:
        f:capabilities:
          .:
          f:enabledCapabilities:
          f:knownCapabilities:
        f:conditions:
        f:desired:
          .:
          f:image:
          f:url:
          f:version:
        f:history:
        f:observedGeneration:
        f:versionHash:
    Manager:         cluster-version-operator
    Operation:       Update
    Subresource:     status
    Time:            2023-11-10T06:58:06Z
  Resource Version:  241224594
  UID:               56a916ef-1f62-4b12-b04e-574675d1089c
Spec:
  Channel:     stable-4.12
  Cluster ID:  18bc54
  Desired Update:
    Image:    quay.io/openshift-release-dev/ocp-release@sha256:b0b1aac82f9083d20e7e4269b05dd3679299d277d122fa9d29b772f38d2cacff
    Version:  4.12.40
Status:
  Available Updates:  <nil>
  Capabilities:
    Enabled Capabilities:
      CSISnapshot
      Console
      Insights
      Storage
      baremetal
      marketplace
      openshift-samples
    Known Capabilities:
      CSISnapshot
      Console
      Insights
      Storage
      baremetal
      marketplace
      openshift-samples
  Conditions:
    Last Transition Time:  2023-07-25T06:06:14Z
    Message:               Kubernetes 1.26 and therefore OpenShift 4.13 remove several APIs that require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6958394 for details and instructions.
    Reason:                AdminAckRequired
    Status:                False
    Type:                  Upgradeable
    Last Transition Time:  2023-07-25T06:06:14Z
    Message:               Capabilities match configured spec
    Reason:                AsExpected
    Status:                False
    Type:                  ImplicitlyEnabledCapabilities
    Last Transition Time:  2023-07-25T06:06:14Z
    Message:               Payload loaded version="4.12.40" image="quay.io/openshift-release-dev/ocp-release@sha256:b0b1aac82f9083d20e7e4269b05dd3679299d277d122fa9d29b772f38d2cacff" architecture="amd64"
    Reason:                PayloadLoaded
    Status:                True
    Type:                  ReleaseAccepted
    Last Transition Time:  2023-07-25T10:21:40Z
    Message:               Done applying 4.12.17
    Status:                True
    Type:                  Available
    Last Transition Time:  2023-11-01T09:33:14Z
    Message:               Cluster operator storage is not available
    Reason:                ClusterOperatorNotAvailable
    Status:                True
    Type:                  Failing
    Last Transition Time:  2023-11-01T08:58:02Z
    Message:               Unable to apply 4.12.40: the cluster operator storage is not available
    Reason:                ClusterOperatorNotAvailable
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2023-11-10T06:58:06Z
    Message:               Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.12&id=818bc54&version=4.12.40": dial tcp 34.239.99.247:443: connect: connection timed out
    Reason:                RemoteFailed
    Status:                False
    Type:                  RetrievedUpdates
  Desired:
    Image:    quay.io/openshift-release-dev/ocp-release@sha256:b0b1aac82f9083d20e7e4269b05dd3679299d277d122fa9d29b772f38d2cacff
    URL:      https://access.redhat.com/errata/RHSA-2023:5896
    Version:  4.12.40
  History:
    Completion Time:    <nil>
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:b0b1aac82f9083d20e7e4269b05dd3679299d277d122fa9d29b772f38d2cacff
    Started Time:       2023-11-01T08:58:02Z
    State:              Partial
    Verified:           true
    Version:            4.12.40
    Completion Time:    2023-07-25T10:21:40Z
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:7ca5f8aa44bbc537c5a985a523d87365eab3f6e72abc50b7be4caae741e093f4
    Started Time:       2023-07-25T06:06:14Z
    State:              Completed
    Verified:           false
    Version:            4.12.17
  Observed Generation:  6
  Version Hash:         hGErDPikQok=
Events:                 <none>
PS C:\Users\suyash.sambhare>
Enter fullscreen mode Exit fullscreen mode

Top comments (0)