Fix invalid checkpoint record in Postgres

#devops #postgres #data #kubernetes

When PostgreSQL is unable to detect a valid checkpoint from which to begin the recovery procedure, it displays the error "PANIC: could not locate a valid checkpoint record." This may occur if the PostgreSQL container is not restarted safely.

Prevent data corruption

For stateful applications like databases, the first method is to use Stateful Sets rather than Deployments; this is a Kubernetes best practice. To preserve the stability and integrity of your database, Stateful Sets offer assurances regarding the ordering and uniqueness of pods. Additionally, stateful sets will manage restarts more effectively.

The second method is to safely shut down your PostgreSQL instance by using a preStop hook in Kubernetes.
Before a pod is terminated in Kubernetes, the preStop hook is invoked. Once the preStop hook is finished, the pod's termination process starts.

Apply the preStop hook in a Kubernetes configuration for PostgreSQL:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  replicas: 1
  serviceName: "postgres"
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:latest
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: postgres-data
          mountPath: /var/lib/postgresql/data
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "pg_ctl -D /var/lib/postgres/data -w -t 60 -m fast stop"]
  volumeClaimTemplates:
  - metadata:
      name: postgres-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

This preStop hook runs the pg_ctl stop command to shut down the PostgreSQL server.

-D /var/lib/postgres/data option specifies the directory where the data base files live,
-w waits until the server shuts down,
-t 60 specifies the wait timeout in seconds, and
-m fast means to do a "fast" shutdown, which rolls back all active transactions, disconnects clients immediately and shuts down.

For AWS ECS deployments you can set the Min and max running tasks respectively: 0% min and 100% max

Recover from an error state

This can be fixed by running pg_resetwal command by connecting to the container.
However, since the pod is in crashLoopBackOff state, we will not be able to connect to the container.
We must first make the pod to be in stable state to execute the command.

Please follow the steps below for manual execution of the command pg_resetwal in a container.
This process requires the pod to be in a stable state, and not in a crash loop.
You can achieve this by introducing a delay in the postgres-deployment.yaml file.

Scale down the deployment by reducing the replica count of the deployment to zero using the following command: kubectl scale deployment postgres - replicas=0
Introduce a delay: Modify the postgres-deployment.yaml file to include a sleep command that delays the initialization process by 600 seconds, keeping the pod in the initializing state.

Here is an example of how to add it:

spec:
      containers:
      - name: postgres
        image: postgres:latest
        command: [ "/bin/bash", "-c", " - " ] 
        args: [ "sleep 600;" ]

Scale up the deployment: Bring the deployment back up by increasing the replica count to one using the following command: kubectl scale deployment postgres - replicas=1
Execute pg_resetwal in the pod: Once the pod reaches the initializing state, run the pg_resetwal command by executing into the pod. Here is how you can do this:

kubectl exec -it postgres - /bin/bash
su postgres
pg_resetwal /var/lib/postgres/data

Proceed despite the warning: If you do receive the warning, continue the process by forcing the pg_resetwal command as follows: pg_resetwal /var/lib/posgtgres/data -f

After running this command, you should receive a confirmation message stating “Write-ahead log reset.”
After you have successfully executed the pg_resetwal command, you should be able to restart the PostgreSQL server.
Remember that the pg_resetwal command is a measure of last resort and carries the risk of data loss or inconsistency.
Always make sure to maintain regular backups of your PostgreSQL databases and consider setting up high availability and replication solutions for your production databases.

Ref: https://medium.com/@adnanitdev/fix-for-error-panic-could-not-locate-a-valid-checkpoint-record-in-postgres-or-citus-running-in-b03d8341a258

Debug School

Fix invalid checkpoint record in Postgres

Prevent data corruption

Recover from an error state

Top comments (0)