PostgreSQL pod fails to restart properly if DB initialization didn't finish properly in previous pod run #309

iankko · 2019-02-04T21:29:11Z

When the OpenShift cluster is under higher load (hundreds of tests running in sequential run, 3-5 tests running in parallel), it might take longer time for the PostgreSQL pod to start. If the database initialization of the pod started in 1th run, but didn't correctly finish, e.g.:

$ cat first_pod_log.txt 
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
creating template1 database in /var/lib/pgsql/data/userdata/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok

And the readiness / liveness probe "decided" to pod in question needs to be restarted, a subsequent pod will end up in Crash Loop Back-off state with the msg like the following one in the pod log:

$ cat pod_first_restart_log.txt 
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
........................................................ done
server started
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
psql: FATAL: database "postgres" does not exist

And this scenario (try to restart the PostgreSQL pod, which subsequently fails with psql: FATAL: database "postgres" does not exist) is then retried couple of times, till the default deploymentconfig timeout (600seconds IIRC) is reached.

JFTR, the aforementioned behaviour was observed with rhscl/postgresql-95-rhel7:latest image (but assuming looking at the code, the different image versions might be prone to the very same issue).

The text was updated successfully, but these errors were encountered:

iankko · 2019-02-04T21:31:56Z

JFTR, here's the log of the pod, when the DB initialization finished successfully:

$ cat complete_pod_log.txt 
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
creating template1 database in /var/lib/pgsql/data/userdata/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
syncing data to disk ... ok

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    pg_ctl -D /var/lib/pgsql/data/userdata -l logfile start

waiting for server to start....LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
 done
server started
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
ALTER ROLE
waiting for server to shut down.... done
server stopped
Starting server...
LOG:  redirecting log output to logging collector process
HINT: Future log output will appear in directory "pg_log".

Compared to the previous / above one, it's visible the pod got restarted sooner, than it has had chance to sync the necessary DB init configuration to the disk / PVC.

iankko · 2019-02-04T21:46:06Z

Looking at the code, the problem seems to be this test. Since there are more steps performed within the initialize_database routine, than just creation of the "$PGDATA/postgresql.conf" configuration file.

E.g. what if the pod was restarted somewhere around this step ?

Since in the next run of the pod, the "$PGDATA/postgresql.conf" file already exists, but PG_INITIALIZED variable wasn't properly (re)set, the necessary users, which are expected to be created later, aren't actually created. And we enter that try-to-restart pod, pod crash loop back off loop.

The quick (but untested) approach might be to make the DB initialization procedure atomic one (either to perform all parts of it, or retry by next run). Will submit an untested patch in a bit.

iankko · 2019-02-04T21:57:13Z

@pkubatrh @praiskup @hhorak @bparees PTAL -^ (once got a chance)

Thank you, Jan

praiskup · 2019-02-05T07:30:37Z

The quick (but untested) approach might be to make the DB initialization procedure atomic one

This sounds like we'd have to try to initdb, and if that failed -- we'd have to remove the leftovers, and try again. We need to be very careful here. Doing rm -rf "$PGDATA" is a really dangerous thing,
even if we made the situation really obvious..

Doing initdb && touch "$PGDATA/initdb_succeded" would lead to vanished $PGDATA in upgrade scenario. Doing touch $PGDATA/initdb_in_progress && initdb && rm $PGDATA/initdb_in_progress would work if initdb allowed us to initialize non-empty database (but it doesn't).

We can touch "$PGDATA/../initdb_in_progress" probably, but it's not 100% clear that the container user can write anywhere else then "$PGDATA" nowadays (== $PGDATA/../userdata). But one would have to be brave enough to to test -f "$PGDATA/../initdb_in_progress" && rm "$PGDATA/../initdb_in_progress && rm -r "$PGDATA"

iankko · 2019-02-05T10:37:37Z

@praiskup Thanks for looking (sorry, mtgs in the morning). Are we able to tell how the proper DB initialization should look like?

(re: This sounds like we'd have to try to initdb, and if that failed -- we'd have to remove the leftovers, and try again.)

From what I have tried on Fedora, if the DB is already initialized, calling another DB init won't do anything. But it won't solve the case if the DB wasn't initialized properly completely in previous run.

iankko · 2019-02-05T11:05:11Z

@praiskup Was originally thinking about enhancing the original if [ ! -f "$PGDATA/postgresql.conf" ]; then check with something like if [ ! -f "$PGDATA/postgresql.conf" ] || ! grep -q '1' "$PGDATA/db_init_complete"; then, and if this happens, remove (via rm -rf?) the $PGDATA directory, and retry the initialization.

praiskup · 2019-02-05T12:49:23Z

! grep -q '1' "$PGDATA/db_init_complete" would remove datadir (mounted volume) initialized e.g. by previous version of the container image

iankko · 2019-02-05T16:34:53Z

@praiskup

Ok, fair enough.

So what about this? -- instead of trying to identify the leftover content, that should be removed if the initdb failed in previous run, we would rather change the $PGDATA directory to point to some temporary directory. Run the initdb there with -Doption. Once this finished, we would copy all the content of that directory to true $PGDATA directory (preserving permissions if needed), and set $PG_INITIALIZED to true (IOW that if [ ! -f "$PGDATA/postgresql.conf" ] check would be left as is, we would just shuffle the $PGDATA directory location a bit)

If it failed in 1th pod run, next time it would be generated again into another temporary directory (no hurt). If it succeeded, it succeeded as a whole. Could this work?

iankko · 2019-02-05T19:33:13Z

@douglaspalmer @martin-kanis @pskopek @drichtarik @mposolda @Pepo48 jFYI

iankko · 2019-03-11T17:27:44Z

Reproducer: (one way how to simulate slow PVC (maybe there are more) is as follows):

Use NFS based PVCs
Set rmem_default & rmem_max to "very" low values & restart NFS server:

# find /proc -name rmem_default
/proc/sys/net/core/rmem_default
# echo '32768' > /proc/sys/net/core/rmem_default
# find /proc -name rmem_max
# echo '32768' > /proc/sys/net/core/rmem_max

and restart NFS server:

# service nfs restart
Redirecting to /bin/systemctl restart nfs.service

I have used simple scenario when the OpenShift node, where PostgreSQL pod is to be scheduled at, is / was also the NFS server. If you are using different setup, you will need to modify the scenario above appropriately.

Start the PostgreSQL pod in usual way. The above change will make the initdb process to take longer time. During the time the new PostgreSQL database cluster is being created for the new pod, delete it (via oc CLI or via OpenShift webconsole).

Current result:
The pod in 2nd run fails with psql: FATAL message above (exactly the same, or its equivalent like some required user isn't created.

Expected result:
PostgreSQL pod is able to detect the initdb process failed (or got interrupted by pod restart), and is able to recover from that situation.

praiskup · 2019-03-11T19:33:48Z

If the POD is artificially slow, one should probably ensure that the initial liveness probe delay is long enough, I'm afraid.

iankko · 2019-03-12T08:14:52Z

@praiskup JFTR, the reproducer above was just to emulate scenario that happens in real clusters hosted in cloud (having network latencies etc). Not to artificially invent some unrealistic one. It is just a helper to see, what happens when PVC is slow.

The issue is, often in these kind of environments, the pod deployer doesn't have control / ability to modify the PVC configuration (they can change deployment config of PostgreSQL yes, but it's impractical in the case when you want to have one universal template, that would work for every environment).

Besides that, images intended for OpenShift should be designed to be stateless (since OpenShift can at any time bounce the pod due to some before unclear, cluster internal / specific reason). Using persistent storage & having the chance to launch the pod only once (because on 2nd run initdb will fail due to the $PGDATA directory not to be empty) will not work, because each time OpenShift (etcd) decides at some moment, it needs to bounce the DB pod, the deployment will never succeed (the pod in question will keep getting restarted till the default OpenShift timeout, 600s IIRC, is reached).

praiskup · 2019-03-12T08:40:51Z

Hmpfs, let's skip the "stateless VS databases" topic.

OpenShift seems to have only static limits for liveness/readiness/start-up... (there's no way to ask OpenShift how fast it is to set-up the limits) so we have to pick some value, which is sane for general use. And those who are running very slow OpenShift instance or have a non-standard use-case need to pick different values for timeouts.

Yes, the initdb shouldn't end-up in inconsistent state (that's what your PRs are about). But that is just corner-case handling; you are dealing with "for PG" unusable stack, with very low limits, and just making the "initdb" part atomic won't help you (even if that was atomic, OpenShift would keep retrying && failing, till it fails entirely anyways...).

Since we have no real deadlocks to be detected in livenessProbe for now, lets have it (effectively) disabled for now. The liveness probe used to cause issues before, namely because it: - killed initdb on a rather slow storage - would pg_upgrade for rather large data directory - killed pod when PostgreSQL was in crash recovery Fixes: sclorg#313, sclorg#309 Relates: sclorg#316

Since we have no real deadlocks to be detected in livenessProbe for now, lets have it (effectively) disabled for now. The liveness probe used to cause issues before, namely because it: - killed initdb on a rather slow storage - would pg_upgrade for rather large data directory - killed pod when PostgreSQL was in crash recovery Fixes: #313, #309 Relates: #316

pkubatrh · 2019-10-04T07:49:09Z

This should be now fixed by #320

iankko · 2019-10-04T08:07:17Z

@pkubatrh @praiskup Wonderful, TYVM!

…nitialDelaySeconds" to prevent readiness / liveness probe to end up DB pod lifecycle prematurely (yet before the DB server has had chance properly to initialize) Customize: * "timeoutSeconds" to 10 and * "initialDelaySeconds" to 90 and also specify: * "successThreshold:" to 1 and * "failureThreshold" to 3 on readiness and liveness probes in persistent templates to avoid readiness / liveness probe to bail out DB pod lifecycle prematurely due to: * "Inappropriate ioctl for device" event (case of MySQL probes), or * Issues like sclorg/postgresql-container#309 (case of PostgreSQL probes) Signed-off-by: Jan Lieskovsky <jlieskov@redhat.com>

iankko mentioned this issue Feb 4, 2019

[run-postgresql] Make the DB initialization atomic #310

Closed

iankko mentioned this issue Feb 28, 2019

[KEYCLOAK-8742] [run-postgresql] Make the DB initialization atomic #312

Closed

iankko mentioned this issue Mar 11, 2019

[KEYCLOAK-8742] Recover if initdb previously failed or got interrupted #319

Closed

praiskup mentioned this issue Mar 20, 2019

make livenessProbe to return always true #320

Merged

pkubatrh closed this as completed Oct 4, 2019

alecmerdler mentioned this issue Apr 12, 2021

Fix Managed Database Initialization (PROJQUAY-1664) quay/quay-operator#427

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PostgreSQL pod fails to restart properly if DB initialization didn't finish properly in previous pod run #309

PostgreSQL pod fails to restart properly if DB initialization didn't finish properly in previous pod run #309

iankko commented Feb 4, 2019 •

edited

Loading

iankko commented Feb 4, 2019

iankko commented Feb 4, 2019 •

edited

Loading

iankko commented Feb 4, 2019

praiskup commented Feb 5, 2019

iankko commented Feb 5, 2019

iankko commented Feb 5, 2019

praiskup commented Feb 5, 2019 •

edited

Loading

iankko commented Feb 5, 2019 •

edited

Loading

iankko commented Feb 5, 2019 •

edited

Loading

iankko commented Mar 11, 2019

praiskup commented Mar 11, 2019

iankko commented Mar 12, 2019

praiskup commented Mar 12, 2019

pkubatrh commented Oct 4, 2019

iankko commented Oct 4, 2019

PostgreSQL pod fails to restart properly if DB initialization didn't finish properly in previous pod run #309

PostgreSQL pod fails to restart properly if DB initialization didn't finish properly in previous pod run #309

Comments

iankko commented Feb 4, 2019 • edited Loading

iankko commented Feb 4, 2019

iankko commented Feb 4, 2019 • edited Loading

iankko commented Feb 4, 2019

praiskup commented Feb 5, 2019

iankko commented Feb 5, 2019

iankko commented Feb 5, 2019

praiskup commented Feb 5, 2019 • edited Loading

iankko commented Feb 5, 2019 • edited Loading

iankko commented Feb 5, 2019 • edited Loading

iankko commented Mar 11, 2019

praiskup commented Mar 11, 2019

iankko commented Mar 12, 2019

praiskup commented Mar 12, 2019

pkubatrh commented Oct 4, 2019

iankko commented Oct 4, 2019

iankko commented Feb 4, 2019 •

edited

Loading

iankko commented Feb 4, 2019 •

edited

Loading

praiskup commented Feb 5, 2019 •

edited

Loading

iankko commented Feb 5, 2019 •

edited

Loading

iankko commented Feb 5, 2019 •

edited

Loading