This question is a follow up on a previous question I posted about efficient ways to max up Postgres pg_wal
directory (for training purposes) : How to bloat pg_wal?.
I'm now wondering what are the possible solutions for getting a Postgres server up and running once the partition is full because pg_wal
has filled up.
I'm interesting in solutions that do not involve adding extra disk space to the partition.
Here are a the 2 other solutions I've come accross while discussing with colleagues :
- Move the entire
pg_wal
directory to another partition with sufficient available disk space and point to this new location from$PGDATA
(ok this is kind of an additional disk space solution) - Regain disk space by removing all WALs that have already been archived (a file with the same name and suffixed by
.done
should be present inpg_wal/archive_status
)
Using pgBackRest
I ran the archive-push
command manually and then removed the WALs from pg_wal
directory but I got the following error when starting Postgres :
2022-06-01 13:54:47 UTC [9334]: user=,db=,app=,client=LOG: invalid primary checkpoint record
2022-06-01 13:54:47 UTC [9334]: user=,db=,app=,client=PANIC: could not locate a valid checkpoint record
Obviously I've removed too many files, but I'm wondering if a clean solution based on the same idea could be used.
Question
Is there a way to clean up pg_wal
without having to restart Postgres once it has shutdown because pg_wal
has filled up ?
CodePudding user response:
I'm interesting in solutions that do not involve adding extra disk space to the partition.
Hmm, then you are excluding the proper solution, which is exactly to increase the disk space on the WAL file system.
Your first solution (move pg_wal
and put a symbolic link into the data directory) is entirely feasible. But as you say, that requires additional disk space, so why not extend the actual WAL file system instead?
Your second idea is not commendable. True, WAL segments that are marked as .done
can be removed, but PostgreSQL will do that automatically at the next checkpoint anyway, so there should not be many of those lying around at any given time. And as you noticed, manually messing with the data directory is not a good idea; the danger of breaking your database is just too high.
Hands off from pg_resetwal
. This executable, when run on a crashed data directory, will remove WAL and put PostgreSQL into a state where you can start it, but it will cause data corruption. pg_resetwal
is intended as a desperate measure to get a corrupted server to start so you can salvage some of the data.
CodePudding user response:
You probably don't need very much space to get it running again, so you should only need to remove a handful of files with corresponding '.done' entries, not all of them. Once the system is running it should go through and clean things up the rest of the way on its own. If you had removed just the oldest handful, it might not have caused a problem in the first place. But given that you have already removed too many, you should be able to manually copy the important ones back again from the archive to pg_wal.
You could instead delete (really, copy someplace else first, then delete) any recycled ahead WAL files, which were already archived and then were renamed to a future name in anticipation of reuse. The problem is how to identify them with high confidence on a crashed instance.
What I do is keep a file with a couple gig of random gibberish (to defeat transparent compression, if any is in use) as "ballast" someplace in the same partition as the data directory. Then if I run into such an emergency, I just delete the ballast file, then try to remember to recreate it later.