Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PostgreSQL pod fails to restart properly if DB initialization didn't finish properly in previous pod run #309

Closed
iankko opened this issue Feb 4, 2019 · 15 comments

Comments

@iankko
Copy link

iankko commented Feb 4, 2019

When the OpenShift cluster is under higher load (hundreds of tests running in sequential run, 3-5 tests running in parallel), it might take longer time for the PostgreSQL pod to start. If the database initialization of the pod started in 1th run, but didn't correctly finish, e.g.:

$ cat first_pod_log.txt 
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
creating template1 database in /var/lib/pgsql/data/userdata/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok

And the readiness / liveness probe "decided" to pod in question needs to be restarted, a subsequent pod will end up in Crash Loop Back-off state with the msg like the following one in the pod log:

$ cat pod_first_restart_log.txt 
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
........................................................ done
server started
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
psql: FATAL: database "postgres" does not exist

And this scenario (try to restart the PostgreSQL pod, which subsequently fails with psql: FATAL: database "postgres" does not exist) is then retried couple of times, till the default deploymentconfig timeout (600seconds IIRC) is reached.

JFTR, the aforementioned behaviour was observed with rhscl/postgresql-95-rhel7:latest image (but assuming looking at the code, the different image versions might be prone to the very same issue).

@iankko
Copy link
Author

iankko commented Feb 4, 2019

JFTR, here's the log of the pod, when the DB initialization finished successfully:

$ cat complete_pod_log.txt 
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
creating template1 database in /var/lib/pgsql/data/userdata/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
syncing data to disk ... ok

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    pg_ctl -D /var/lib/pgsql/data/userdata -l logfile start

waiting for server to start....LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
 done
server started
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
ALTER ROLE
waiting for server to shut down.... done
server stopped
Starting server...
LOG:  redirecting log output to logging collector process
HINT: Future log output will appear in directory "pg_log".

Compared to the previous / above one, it's visible the pod got restarted sooner, than it has had chance to sync the necessary DB init configuration to the disk / PVC.

@iankko
Copy link
Author

iankko commented Feb 4, 2019

Looking at the code, the problem seems to be this test. Since there are more steps performed within the initialize_database routine, than just creation of the "$PGDATA/postgresql.conf" configuration file.

E.g. what if the pod was restarted somewhere around this step ?

Since in the next run of the pod, the "$PGDATA/postgresql.conf" file already exists, but PG_INITIALIZED variable wasn't properly (re)set, the necessary users, which are expected to be created later, aren't actually created. And we enter that try-to-restart pod, pod crash loop back off loop.

The quick (but untested) approach might be to make the DB initialization procedure atomic one (either to perform all parts of it, or retry by next run). Will submit an untested patch in a bit.

@iankko
Copy link
Author

iankko commented Feb 4, 2019

@pkubatrh @praiskup @hhorak @bparees PTAL -^ (once got a chance)

Thank you, Jan

@praiskup
Copy link
Contributor

praiskup commented Feb 5, 2019

The quick (but untested) approach might be to make the DB initialization procedure atomic one

This sounds like we'd have to try to initdb, and if that failed -- we'd have to remove the leftovers, and try again. We need to be very careful here. Doing rm -rf "$PGDATA" is a really dangerous thing,
even if we made the situation really obvious..

Doing initdb && touch "$PGDATA/initdb_succeded" would lead to vanished $PGDATA in upgrade scenario. Doing touch $PGDATA/initdb_in_progress && initdb && rm $PGDATA/initdb_in_progress would work if initdb allowed us to initialize non-empty database (but it doesn't).

We can touch "$PGDATA/../initdb_in_progress" probably, but it's not 100% clear that the container user can write anywhere else then "$PGDATA" nowadays (== $PGDATA/../userdata). But one would have to be brave enough to to test -f "$PGDATA/../initdb_in_progress" && rm "$PGDATA/../initdb_in_progress && rm -r "$PGDATA"

@iankko
Copy link
Author

iankko commented Feb 5, 2019

@praiskup Thanks for looking (sorry, mtgs in the morning). Are we able to tell how the proper DB initialization should look like?

(re: This sounds like we'd have to try to initdb, and if that failed -- we'd have to remove the leftovers, and try again.)

From what I have tried on Fedora, if the DB is already initialized, calling another DB init won't do anything. But it won't solve the case if the DB wasn't initialized properly completely in previous run.

@iankko
Copy link
Author

iankko commented Feb 5, 2019

@praiskup Was originally thinking about enhancing the original if [ ! -f "$PGDATA/postgresql.conf" ]; then check with something like if [ ! -f "$PGDATA/postgresql.conf" ] || ! grep -q '1' "$PGDATA/db_init_complete"; then, and if this happens, remove (via rm -rf?) the $PGDATA directory, and retry the initialization.

@praiskup
Copy link
Contributor

praiskup commented Feb 5, 2019

! grep -q '1' "$PGDATA/db_init_complete" would remove datadir (mounted volume) initialized e.g. by previous version of the container image

@iankko
Copy link
Author

iankko commented Feb 5, 2019

@praiskup

Ok, fair enough.

So what about this? -- instead of trying to identify the leftover content, that should be removed if the initdb failed in previous run, we would rather change the $PGDATA directory to point to some temporary directory. Run the initdb there with -Doption. Once this finished, we would copy all the content of that directory to true $PGDATA directory (preserving permissions if needed), and set $PG_INITIALIZED to true (IOW that if [ ! -f "$PGDATA/postgresql.conf" ] check would be left as is, we would just shuffle the $PGDATA directory location a bit)

If it failed in 1th pod run, next time it would be generated again into another temporary directory (no hurt). If it succeeded, it succeeded as a whole. Could this work?

@iankko
Copy link
Author

iankko commented Feb 5, 2019

@iankko
Copy link
Author

iankko commented Mar 11, 2019

Reproducer: (one way how to simulate slow PVC (maybe there are more) is as follows):

  1. Use NFS based PVCs
  2. Set rmem_default & rmem_max to "very" low values & restart NFS server:
# find /proc -name rmem_default
/proc/sys/net/core/rmem_default
# echo '32768' > /proc/sys/net/core/rmem_default
# find /proc -name rmem_max
# echo '32768' > /proc/sys/net/core/rmem_max

and restart NFS server:

# service nfs restart
Redirecting to /bin/systemctl restart nfs.service

I have used simple scenario when the OpenShift node, where PostgreSQL pod is to be scheduled at, is / was also the NFS server. If you are using different setup, you will need to modify the scenario above appropriately.

  1. Start the PostgreSQL pod in usual way. The above change will make the initdb process to take longer time. During the time the new PostgreSQL database cluster is being created for the new pod, delete it (via oc CLI or via OpenShift webconsole).

Current result:
The pod in 2nd run fails with psql: FATAL message above (exactly the same, or its equivalent like some required user isn't created.

Expected result:
PostgreSQL pod is able to detect the initdb process failed (or got interrupted by pod restart), and is able to recover from that situation.

@praiskup
Copy link
Contributor

If the POD is artificially slow, one should probably ensure that the initial liveness probe delay is long enough, I'm afraid.

@iankko
Copy link
Author

iankko commented Mar 12, 2019

@praiskup JFTR, the reproducer above was just to emulate scenario that happens in real clusters hosted in cloud (having network latencies etc). Not to artificially invent some unrealistic one. It is just a helper to see, what happens when PVC is slow.

The issue is, often in these kind of environments, the pod deployer doesn't have control / ability to modify the PVC configuration (they can change deployment config of PostgreSQL yes, but it's impractical in the case when you want to have one universal template, that would work for every environment).

Besides that, images intended for OpenShift should be designed to be stateless (since OpenShift can at any time bounce the pod due to some before unclear, cluster internal / specific reason). Using persistent storage & having the chance to launch the pod only once (because on 2nd run initdb will fail due to the $PGDATA directory not to be empty) will not work, because each time OpenShift (etcd) decides at some moment, it needs to bounce the DB pod, the deployment will never succeed (the pod in question will keep getting restarted till the default OpenShift timeout, 600s IIRC, is reached).

@praiskup
Copy link
Contributor

Hmpfs, let's skip the "stateless VS databases" topic.

OpenShift seems to have only static limits for liveness/readiness/start-up... (there's no way to ask OpenShift how fast it is to set-up the limits) so we have to pick some value, which is sane for general use. And those who are running very slow OpenShift instance or have a non-standard use-case need to pick different values for timeouts.

Yes, the initdb shouldn't end-up in inconsistent state (that's what your PRs are about). But that is just corner-case handling; you are dealing with "for PG" unusable stack, with very low limits, and just making the "initdb" part atomic won't help you (even if that was atomic, OpenShift would keep retrying && failing, till it fails entirely anyways...).

praiskup added a commit to praiskup/postgresql-container that referenced this issue Mar 20, 2019
Since we have no real deadlocks to be detected in livenessProbe
for now, lets have it (effectively) disabled for now.

The liveness probe used to cause issues before, namely because it:
- killed initdb on a rather slow storage
- would pg_upgrade for rather large data directory
- killed pod when PostgreSQL was in crash recovery

Fixes: sclorg#313, sclorg#309
Relates: sclorg#316
pkubatrh pushed a commit that referenced this issue Apr 2, 2019
Since we have no real deadlocks to be detected in livenessProbe
for now, lets have it (effectively) disabled for now.

The liveness probe used to cause issues before, namely because it:
- killed initdb on a rather slow storage
- would pg_upgrade for rather large data directory
- killed pod when PostgreSQL was in crash recovery

Fixes: #313, #309
Relates: #316
@pkubatrh
Copy link
Member

pkubatrh commented Oct 4, 2019

This should be now fixed by #320

@pkubatrh pkubatrh closed this as completed Oct 4, 2019
@iankko
Copy link
Author

iankko commented Oct 4, 2019

@pkubatrh @praiskup Wonderful, TYVM!

iankko pushed a commit to iankko/redhat-sso-7-openshift-image that referenced this issue Mar 5, 2020
…nitialDelaySeconds"

to prevent readiness / liveness probe to end up DB pod lifecycle prematurely
(yet before the DB server has had chance properly to initialize)

Customize:
* "timeoutSeconds" to 10 and
* "initialDelaySeconds" to 90

and also specify:
* "successThreshold:" to 1 and
* "failureThreshold" to 3

on readiness and liveness probes in persistent templates to avoid readiness / liveness probe
to bail out DB pod lifecycle prematurely due to:
* "Inappropriate ioctl for device" event (case of MySQL probes), or
* Issues like sclorg/postgresql-container#309 (case of PostgreSQL probes)

Signed-off-by: Jan Lieskovsky <jlieskov@redhat.com>
iankko pushed a commit to iankko/redhat-sso-7-openshift-image that referenced this issue Mar 6, 2020
…nitialDelaySeconds"

to prevent readiness / liveness probe to end up DB pod lifecycle prematurely
(yet before the DB server has had chance properly to initialize)

Customize:
* "timeoutSeconds" to 10 and
* "initialDelaySeconds" to 90

and also specify:
* "successThreshold:" to 1 and
* "failureThreshold" to 3

on readiness and liveness probes in persistent templates to avoid readiness / liveness
probe to bail out DB pod lifecycle prematurely due to:
* "Inappropriate ioctl for device" event (case of MySQL probes), or
* Issues like sclorg/postgresql-container#309 (case of PostgreSQL probes)

Signed-off-by: Jan Lieskovsky <jlieskov@redhat.com>
mhajas pushed a commit to mhajas/redhat-sso-7-openshift-image that referenced this issue Mar 10, 2020
…nitialDelaySeconds"

to prevent readiness / liveness probe to end up DB pod lifecycle prematurely
(yet before the DB server has had chance properly to initialize)

Customize:
* "timeoutSeconds" to 10 and
* "initialDelaySeconds" to 90

and also specify:
* "successThreshold:" to 1 and
* "failureThreshold" to 3

on readiness and liveness probes in persistent templates to avoid readiness / liveness
probe to bail out DB pod lifecycle prematurely due to:
* "Inappropriate ioctl for device" event (case of MySQL probes), or
* Issues like sclorg/postgresql-container#309 (case of PostgreSQL probes)

Signed-off-by: Jan Lieskovsky <jlieskov@redhat.com>
mhajas pushed a commit to mhajas/redhat-sso-7-openshift-image that referenced this issue Mar 11, 2020
…nitialDelaySeconds"

to prevent readiness / liveness probe to end up DB pod lifecycle prematurely
(yet before the DB server has had chance properly to initialize)

Customize:
* "timeoutSeconds" to 10 and
* "initialDelaySeconds" to 90

and also specify:
* "successThreshold:" to 1 and
* "failureThreshold" to 3

on readiness and liveness probes in persistent templates to avoid readiness / liveness
probe to bail out DB pod lifecycle prematurely due to:
* "Inappropriate ioctl for device" event (case of MySQL probes), or
* Issues like sclorg/postgresql-container#309 (case of PostgreSQL probes)

Signed-off-by: Jan Lieskovsky <jlieskov@redhat.com>
mhajas pushed a commit to mhajas/redhat-sso-7-openshift-image that referenced this issue Mar 11, 2020
…nitialDelaySeconds"

to prevent readiness / liveness probe to end up DB pod lifecycle prematurely
(yet before the DB server has had chance properly to initialize)

Customize:
* "timeoutSeconds" to 10 and
* "initialDelaySeconds" to 90

and also specify:
* "successThreshold:" to 1 and
* "failureThreshold" to 3

on readiness and liveness probes in persistent templates to avoid readiness / liveness
probe to bail out DB pod lifecycle prematurely due to:
* "Inappropriate ioctl for device" event (case of MySQL probes), or
* Issues like sclorg/postgresql-container#309 (case of PostgreSQL probes)

Signed-off-by: Jan Lieskovsky <jlieskov@redhat.com>
hmlnarik pushed a commit to jboss-container-images/redhat-sso-7-openshift-image that referenced this issue Mar 12, 2020
…nitialDelaySeconds"

to prevent readiness / liveness probe to end up DB pod lifecycle prematurely
(yet before the DB server has had chance properly to initialize)

Customize:
* "timeoutSeconds" to 10 and
* "initialDelaySeconds" to 90

and also specify:
* "successThreshold:" to 1 and
* "failureThreshold" to 3

on readiness and liveness probes in persistent templates to avoid readiness / liveness
probe to bail out DB pod lifecycle prematurely due to:
* "Inappropriate ioctl for device" event (case of MySQL probes), or
* Issues like sclorg/postgresql-container#309 (case of PostgreSQL probes)

Signed-off-by: Jan Lieskovsky <jlieskov@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants