Debug School

Cover image for Node Clock Not Synchronising in OCP
Suyash Sambhare
Suyash Sambhare

Posted on

Node Clock Not Synchronising in OCP

If the alert NodeClockNotSynchronising is being fired in Openshift Container Platform then we can trouble shoot the issue:

               alert: 'NodeClockNotSynchronising',
               expr: |||
                 min_over_time(node_timex_sync_status[5m]) == 0
               ||| % $._config,
               'for': '10m',
               labels: {
                 severity: 'warning',
               },
               annotations: {
                 summary: 'Clock not synchronizing.',
                 message: 'Clock on {{ $labels.instance }} is not synchronizing. Ensure NTP is configured on this host.',
Enter fullscreen mode Exit fullscreen mode

The clock in the pods is the same as on the host machine because it’s controlled by the kernel. Verify that the host clock time is in sync with Chrony to ensure hosts are in sync and configure it.

Configure chrony time service

Set the time server and related settings used by the chrony time service (chronyd) by modifying the contents of the chrony.conf file and passing those contents to your nodes as a machine config.
Create a Butane config including the contents of the chrony.conf file. For example, to configure chrony on worker nodes, create a 99-worker-chrony.bu file.

        variant: openshift
        version: 4.8.0
        metadata:
          name: 99-worker-chrony 
          labels:
            machineconfiguration.openshift.io/role: worker 
        storage:
          files:
          - path: /etc/chrony.conf
            mode: 0644 
            overwrite: true
            contents:
              inline: |
                pool 0.rhel.pool.ntp.org iburst 
                driftfile /var/lib/chrony/drift
                makestep 1.0 3
                rtcsync
                logdir /var/log/chrony
Enter fullscreen mode Exit fullscreen mode

On control plane nodes, substitute master for worker in both of these locations. Specify an octal value mode for the mode field in the machine config file. After creating the file and applying the changes, the mode is converted to a decimal value. You can check the YAML file with the command oc get mc <mc-name> -o yaml.
Specify any valid, reachable time source, such as the one provided by your DHCP server. You can specify any of the following NTP servers: 1.rhel.pool.ntp.org, 2.rhel.pool.ntp.org, or 3.rhel.pool.ntp.org.
Use Butane to generate a MachineConfig object file, 99-worker-chrony.yaml, containing the configuration to be delivered to the nodes: butane 99-worker-chrony.bu -o 99-worker-chrony.yaml
Apply the configurations in one of two ways:

  • If the cluster is not running yet after you generate manifest files, add the MachineConfig object file to the <installation_directory>/openshift directory, and then continue to create the cluster.
  • If the cluster is already running, apply the file: oc apply -f ./99-worker-chrony.yaml Verify chronyd inside the node is synced:
    chronyc> sourcestats
    210 Number of sources = 2
    Name/IP Address            NP  NR  Span  Frequency  Freq Skew  Offset  Std Dev
    =======================================================================
    domain.com         35  17  139m     +0.001      0.023   +357ns    74us
    domain1.com>   6   3   77m     +0.074      1.692  -1557us   796us
    chronyc> activity
    200 OK
Enter fullscreen mode Exit fullscreen mode

To verify and resolve any server: chronyc ntpdata $IP_or_domain

Chrony

Troubleshooting

To verify if the cluster has been receiving the alert, get the following output:

    $ token=`oc sa get-token prometheus-k8s -n openshift-monitoring` ## --- In OCP client 4.10 or lower ---

    OR

    $ token=`oc create token prometheus-k8s -n openshift-monitoring` ## --- In OCP client 4.11 or higher ---
    curl -k -H "Authorization: Bearer $token" 'https://alertmanager-main-openshift-monitoring.apps.domain/api/v1/alerts' |  jq '.data[].labels'
Enter fullscreen mode Exit fullscreen mode

node_exporter reports sync_status via timex collector as node_timex_sync_status metric on which the alert is based. This mechanism relies on information reported by kernel adjtimex syscall. To verify if the cluster is indeed affected, get the following metrics:

       $  curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s-openshift-monitoring.apps.domain.com/api/v1/query?query=node_timex_sync_status'
       $  curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s-openshift-monitoring.apps.domain.com/api/v1/query?query=node_timex_maxerror_seconds'
       $  curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s-openshift-monitoring.apps.domain.com/api/v1/query?query=node_timex_offset_seconds
Enter fullscreen mode Exit fullscreen mode

It is also possible to get all the metrics as per the following example: oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://<node IP address>:9100/metrics'

The following information would be needed as well to understand if there is any sync issue, and to understand the number of servers and peers that are connected:

    $ systemctl status chronyd
    $ journalctl -u chronyd
    $ chronyc sources -v
    $ chronyc sourcestats -v
    $ chronyc tracking -v
    $ chronyc -N sources -a
    $ chronyc activity -v
    $ chronyc ntpdata
    $ chronyc clients 
    $ cat /etc/chrony.conf 
Enter fullscreen mode Exit fullscreen mode

Side note: NTP servers specified by their hostnames (instead of an IP address) have to have their names resolved before chronyd can send any requests to them. If the activity command prints a non-zero number of sources with unknown addresses, there is an issue with the resolution. The DNS server is specified in /etc/resolv.conf. In case there is any problem reaching the NTP servers, gather a tcpdump while running NTP peering so it is analyzed.
tcpdump -n -i any port 123 -vvvvv -f tcpdumpchrony.pcap

Ref: https://access.redhat.com/solutions/6257001

Top comments (0)