If the alert NodeClockNotSynchronising
is being fired in Openshift Container Platform then we can trouble shoot the issue:
alert: 'NodeClockNotSynchronising',
expr: |||
min_over_time(node_timex_sync_status[5m]) == 0
||| % $._config,
'for': '10m',
labels: {
severity: 'warning',
},
annotations: {
summary: 'Clock not synchronizing.',
message: 'Clock on {{ $labels.instance }} is not synchronizing. Ensure NTP is configured on this host.',
The clock in the pods is the same as on the host machine because itβs controlled by the kernel. Verify that the host clock time is in sync with Chrony to ensure hosts are in sync and configure it.
Configure chrony time service
Set the time server and related settings used by the chrony time service (chronyd
) by modifying the contents of the chrony.conf
file and passing those contents to your nodes as a machine config.
Create a Butane config including the contents of the chrony.conf
file. For example, to configure chrony on worker nodes, create a 99-worker-chrony.bu
file.
variant: openshift
version: 4.8.0
metadata:
name: 99-worker-chrony
labels:
machineconfiguration.openshift.io/role: worker
storage:
files:
- path: /etc/chrony.conf
mode: 0644
overwrite: true
contents:
inline: |
pool 0.rhel.pool.ntp.org iburst
driftfile /var/lib/chrony/drift
makestep 1.0 3
rtcsync
logdir /var/log/chrony
On control plane nodes, substitute master
for worker
in both of these locations. Specify an octal value mode for the mode
field in the machine config file. After creating the file and applying the changes, the mode
is converted to a decimal value. You can check the YAML file with the command oc get mc <mc-name> -o yaml
.
Specify any valid, reachable time source, such as the one provided by your DHCP server. You can specify any of the following NTP servers: 1.rhel.pool.ntp.org
, 2.rhel.pool.ntp.org
, or 3.rhel.pool.ntp.org
.
Use Butane to generate a MachineConfig
object file, 99-worker-chrony.yaml
, containing the configuration to be delivered to the nodes: butane 99-worker-chrony.bu -o 99-worker-chrony.yaml
Apply the configurations in one of two ways:
- If the cluster is not running yet after you generate manifest files, add the
MachineConfig
object file to the<installation_directory>/openshift
directory, and then continue to create the cluster. - If the cluster is already running, apply the file:
oc apply -f ./99-worker-chrony.yaml
Verifychronyd
inside the node is synced:
chronyc> sourcestats
210 Number of sources = 2
Name/IP Address NP NR Span Frequency Freq Skew Offset Std Dev
=======================================================================
domain.com 35 17 139m +0.001 0.023 +357ns 74us
domain1.com> 6 3 77m +0.074 1.692 -1557us 796us
chronyc> activity
200 OK
To verify and resolve any server: chronyc ntpdata $IP_or_domain
Troubleshooting
To verify if the cluster has been receiving the alert, get the following output:
$ token=`oc sa get-token prometheus-k8s -n openshift-monitoring` ## --- In OCP client 4.10 or lower ---
OR
$ token=`oc create token prometheus-k8s -n openshift-monitoring` ## --- In OCP client 4.11 or higher ---
curl -k -H "Authorization: Bearer $token" 'https://alertmanager-main-openshift-monitoring.apps.domain/api/v1/alerts' | jq '.data[].labels'
node_exporter
reports sync_status
via timex collector as node_timex_sync_status
metric on which the alert is based. This mechanism relies on information reported by kernel adjtimex
syscall. To verify if the cluster is indeed affected, get the following metrics:
$ curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s-openshift-monitoring.apps.domain.com/api/v1/query?query=node_timex_sync_status'
$ curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s-openshift-monitoring.apps.domain.com/api/v1/query?query=node_timex_maxerror_seconds'
$ curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s-openshift-monitoring.apps.domain.com/api/v1/query?query=node_timex_offset_seconds
It is also possible to get all the metrics as per the following example: oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://<node IP address>:9100/metrics'
The following information would be needed as well to understand if there is any sync issue, and to understand the number of servers and peers that are connected:
$ systemctl status chronyd
$ journalctl -u chronyd
$ chronyc sources -v
$ chronyc sourcestats -v
$ chronyc tracking -v
$ chronyc -N sources -a
$ chronyc activity -v
$ chronyc ntpdata
$ chronyc clients
$ cat /etc/chrony.conf
Side note: NTP servers specified by their hostnames (instead of an IP address) have to have their names resolved before chronyd can send any requests to them. If the activity command prints a non-zero number of sources with unknown addresses, there is an issue with the resolution. The DNS server is specified in /etc/resolv.conf. In case there is any problem reaching the NTP servers, gather a tcpdump while running NTP peering so it is analyzed.
tcpdump -n -i any port 123 -vvvvv -f tcpdumpchrony.pcap
Top comments (0)