Home > Software engineering >  How can I optimize postgres insert/update request of huge amount of data?
How can I optimize postgres insert/update request of huge amount of data?

Time:12-29

I'm working on a pathfinding project that use topographic data of huge areas. In order to reduce the huge memory load, my plan is to pre-process the map data by creating nodes that are saved in a PostgresDB on start-up, and then accessed as needed by the algorithm.

I've created 3 docker containers for that, the postgres DB, Adminer and my python app. It works as expected with small amount of data, so the communications between the containers or the application isn't a problem.

The way it works is that you give a 2D array, it takes the first row, convert each element in node and save it in the DB using an psycopg2.extras.execute_value before going to the second row, then third... Once all nodes are registered, it updates each of them by searching for their neighbors and adding their id in the right column. That way it takes longer to pre-process the data, but I have easier access when running the algorithm.

However, I think the DB have trouble processing the data past a certain point. The map I gave comes from a .tif file of 9600x14400, and even when ignoring useless/invalid data, that amount to more than 10 millions of nodes.

Basically, it worked quite slow but okay, until around 90% of the node creation process, where the data stopped being processed. Both python and postgres container were still running and responsive, but there was no more node being created, and the neighbor-linking part of the pre-processing didn't start either. Also there were no error message in either sides.

I've read that the rows limit in a postgres table is absurdly high, but the table also become really slow once a lot of elements are in it, so could it be that it didn't crash or freeze, but just takes an insane amount of time to complete the remaining node creations request?

Would reducing the batch size even more help in that regard? Or would maybe splitting the table into multiple smaller ones be better?

CodePudding user response:

My queries and psycopg function I've used were not optimized for the mass inserts and update I was doing.

The changes I've made were:

  • Reduce batch size from 14k to 1k
  • Making a larger SELECT queries instead of smaller ones
  • Creating indexes on importants columns
  • Changing a normal UPDATE query to the format of an UPDATE FROM with also an executing_value instead of cursor.execute

It made the execution time go from around an estimated 5.5 days to around 8 hours.

  • Related