We have a job that needs to insert millions of records through a REST API. We found that once after inserting a few millions records, the service hangs and stops responding.
I suspect the problem is with the Postgres DB, so I grab the log (from AWS RDS) and found this relevant lines:
2022-09-08 19:17:58 UTC::@:[26730]:LOG: automatic vacuum of table "xxx.xxx.edges": index scans: 1
pages: 0 removed, 2659041 remain, 0 skipped due to pins, 321741 skipped frozen
tuples: 4680884 removed, 48758705 remain, 657299 are dead but not yet removable, oldest xmin: 458583508
index scan needed: 268594 pages from table (10.10% of total) had 4843221 dead item identifiers removed
index "edges_pkey": pages: 398515 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "edges_start_node_id_idx": pages: 289505 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "edges_end_node_id_idx": pages: 263812 in total, 0 newly deleted, 0 currently deleted, 0 reusable
index "edges_bounding_box_gix": pages: 675160 in total, 250 newly deleted, 456 currently deleted, 206 reusable
index "edges_edge_type_idx": pages: 329144 in total, 7630 newly deleted, 7755 currently deleted, 7755 reusable
index "edges_source_road_type_idx": pages: 73470 in total, 49 newly deleted, 49 currently deleted, 43 reusable
I/O timings: read: 38956.761 ms, write: 1681.906 ms
avg read rate: 57.004 MB/s, avg write rate: 29.287 MB/s
buffer usage: 5317313 hits, 2531220 misses, 1300474 dirtied
WAL usage: 2268022 records, 1302892 full page images, 3546499081 bytes
system usage: CPU: user: 153.21 s, system: 10.87 s, elapsed: 346.90 s
I cannot find docs about how to interpret this log.system usage: CPU: user: 153.21 s, system: 10.87 s, elapsed: 346.90 s
, does that mean the automatic vacuum takes about 346 second in total?
While the table is being vacuumed and analyzed, does it forbid reading and writing?
If the answer is yes, then I need to think about how to address the issue. I am thinking to tune the parameters to run auto vacuum in smaller data changes with more resource allocated (by following the guidance here: https://aws.amazon.com/blogs/database/understanding-autovacuum-in-amazon-rds-for-postgresql-environments/), therefore each vacuum finishes faster and does not block for too long, so my job of inserting millions rows can finish. Does that sound like the right approach?
CodePudding user response:
Vacuum does not directly block reading and writing in any substantial way. It will take light-weight locks on pages and other data structures while it manipulates them, but each such lock is held only very briefly. It takes a "heavy" lock on the table itself which it holds for the duration of the autovacuum, but it is a weak lock which won't directly block ordinary reads and writes. It will generate a lot of IO requests, which will compete with other processes also wanting to do IO. This could slow them down, perhaps substantially, but shouldn't completely freeze them. If autovacuum is consuming most of your IOPS capacity, you could throttle it more heavily by changing autovacuum_vacuum_cost_delay and friends. This would make it run more slowly, but have a gentler effect on other processes.
You should tackle the problem with your app directly. If your app is not blocked on locks, it doesn't matter who is holding locks. If it is not blocked on IO, it doesn't matter who else is consuming IO.
If the problem happens while you are watching, you can use pg_stat_activity to see what queries the app is trying to run and what is holding them up (wait_event). If it something that you need to investigate after the fact, you can use auto_explain to log query plans for slow queries. Be sure to turn on track_io_timing and configure auto_explain to record buffer usage aswe. Also turn on log_lock_waits so you can if you are blocking on heavy-weight locks.