Home > Mobile >  How to efficiently merge data into a target table?
How to efficiently merge data into a target table?

Time:09-30

Idea: need to run a daily cron job that upsert delete a bunch of data to/from a target table. The data is in a txt file; it contains the newest version of the data that the target table should eventually have. Therefore size of data in the txt file is similar to the size of the table, because the txt file is a mirror of the table but with some changes.

Let's say the target table has this following data: enter image description here

And the txt file has the following data: enter image description here

In this case we should delete row 0 from target table because it doesn't exist in txt file (newest data) anymore, and update row 1 because his job has changed, and insert row 3 because it's new data that doesn't exist in table but exist in txt file.

Essentially I wanna merge (upsert delete) data from txt file into target table.

Assuming the target table has 1 million rows of data, the txt file also has approximate number of rows. How to efficiently merge data from txt file to target table?

Currently I am reading data from txt file in a pandas dataframe, and use insert on conflict update to do upsert, and use a delete statement to remove rows from table that doesn't exist in dataframe. But it's doing a million row update, which isn't ideal. How to efficient update the table so that only the updated/new data get updated/inserted in the table? So if only 10 rows are updated, we should only update those 10 rows.

CodePudding user response:

Your description basically describes replacing everything in the target_table with the contents of the txt file. The quickest way may be to literally do that. Something like:

start transaction;
truncate target_table; 
copy target_table from 'txt file'; 
commit;  

You will need to translate the above to your specific script/language and probably use an absolute path to the file. See example here. Since I cannot issue copy in a fiddle I substitute text_table for txt file. ?

CodePudding user response:

One way to do this is to use the WHERE clause of DO UPDATE:

INSERT INTO tab ...
ON CONFLICT ... DO UPDATE SET ...
   WHERE (tab.col1 IS DISTINCT FROM EXCLUDED.col1 OR
          tab.col2 IS DISTINCT FROM EXCLUDED.col2 OR
          ...);

The other way is to use the predefined suppress_redundant_updates_trigger() trigger function.

The first solution will probably perform better, since it avoids calling a trigger function, but the second solution will be simpler.

  • Related