postgres 14 "create publication" stuck for hours-CodePudding

I upgraded from Postgres 10 to Postgres 14 using pg_upgrade --link option. The total databases size is around 10TB. The pg_upgrade was successful and quick and like the tool suggested -

Optimizer statistics are not transferred by pg_upgrade. Once you start the new server, consider running: /usr/pgsql-14/bin/vacuumdb --all --analyze-in-stages

I ran the above command but the process is stuck. As a side effect of this (or not, not sure) when I create a publication the prompt never comes back and the publication is not created even after many hours.

postgres=# select * from pg_stat_progress_vacuum;

c1	c2
pid	9520
datid	16402
datname	xyz
relid	22423
phase	vacuuming indexes
heap_blks_total	232816470
heap_blks_scanned	36766348
heap_blks_vacuumed	0
index_vacuum_count	0
max_dead_tuples	11184809
num_dead_tuples	11184521

This is the same output from yesterday. What can I do to speed up this and the "create publication" command? On a side note: The VM on which Postgres is running is quite powerful (64GB RAM, 16 cores). Thanks!

edit 1: the output of pg_stat_activity for the same pid,

c1	c2
pid	9520
backend_start	2021-12-06 15:13:23.479071-08
xact_start	2021-12-06 15:13:23.512581-08
query_start	2021-12-06 15:13:23.512581-08
state_change	2021-12-06 15:13:23.512581-08
wait_event_type	Timeout
wait_event	VacuumDelay
state	active
backend_xmin	3140627534
query	autovacuum: VACUUM xyz (to prevent wraparound)
backend_type	autovacuum worker

CodePudding user response：

vacuumdb --all --analyze-in-stages will not run VACUUM, but ANALYZE, so you have to look into pg_stat_progress_analyze to see how it is doing.

The VACUUM process you see running is unrelated to that. It is an anti-wraparound vacuum that is currently sleeping, but otherwise processing. Let it finish; this process is important for the health of your database. If you want further autovacuum runs on that table to complete faster, reduce autovacuum_vacuum_cost_delay for that table.

CodePudding user response：

Simply going the upgrade should not have caused the anti-wraparound vacuum to run, so that it decided to run right after the upgrade may be a coincidence of timing. On the other hand, maybe your new database version has different config settings, for example a lower value for autovacuum_freeze_max_age, running with this new setting is what triggered it to kick in right now. Did you transfer all of your non-default config settings from v10 to v14?

I agree with Laurenz that do need to let this finish, but that doesn't mean you need to let it finish right now. You could terminate the vacuuming backend so that your CREATE PUBLICATION has a chance to run. The autovac might promptly restart, so you should have the CREATE PUBLICATION already trying to run when you cancel the vacuum. That way it will be able to acquire the lock before the vacuum can start up again and grab the lock again. But do make sure you don't get in the habit of cancelling vacuums every time they inconvenience you.

Also, you should increase maintenance_work_mem by a lot. It looks like it is currently set at 64MB, which is quite low for the server you describe. If you set it to 1GB, then it should be able to vacuum the entire table with only one pass over the indexes rather than the seven you are currently on pace for. I would change this setting in the conf file and SIGHUP the server before you cancel the vacuum, that way the new vacuum that gets launched should have the new setting.

Finally, I don't know why this is vacuuming the index. I thought v14 changed it so that emergency vacuums didn't bother to do that, but just froze tuples in the heap and left indexes for later. I guess I need to study v14 a bit more to figure out what it really does here.