I am writing an application in C# that will copy data from one postgres table to another on a regular basis. I am using the NPGSql library.
I have run into the following issue: When there are thousands of rows to be copied (> 10k), the program runs very slowly.
I have tried:
For my first attempt, I pulled the entirety of the destination table, then compared the data I was inserting to the data that already existed. Then, I would write an insert or update statement depending on whether it already existed but had alterations, or whether it did not exist at all. This was the worst solution, as every individual statement had to be sent as a command.
Next, I tried putting an "on conflict" trigger on the actual table. This let me send all of the inserts as bulk INSERT INTO.... statements, and the table would take care of updates. This was significantly faster, but not fast enough.
I read about Postgres's COPY method, but it does not seem to suit my needs. It seems that COPY will do ONLY an insert, and NOT an upsert. Because I am modifying this table several times, some of the data will be new, but some will be old rows that need updating.
Has anyone come up with a fast way to UPSERT, provided that I need an option to EDIT a row, not just do a blanket mass INSERT of all of my data?
Please let me know if I can provide any other information
Thank you so much for your time
CodePudding user response:
First of all, I assume the tables are on different databases, otherwise I would just do this all in DML.
I think copy is definitely your friend. There is no faster way to extract or load data, and then you can let the database do the heavy lifting.
On the source database:
copy source_table
to '/var/tmp/foo.csv' csv;
On the destination database:
truncate temp_table;
copy temp_table
from '/var/tmp/foo.csv' csv;
insert into destination_table
select *
from temp_table t
where not exists (
select null
from destination_table d
where t.id = d.id
);
update destination_table d
set
field1 = t.field1,
field2 = t.field2
from temp_table t
where
d.id = t.id and
(d.field1 is distinct from t.field1 or
d.field2 is distinct from t.field2)
It would be great if you can do something like this if the data is readily available:
Couple of other comments:
- the insert into uses an anti-join, and this is my favorite construct to insert missing records
- on the update, it's important to specify the criteria for what you udpate -- don't update everything; only those records that have changed. This will make a big difference in performance. Hopefully there are a set number of fields you can use to determine if a record has changed.
If there is a field that indicates the record has been updated (last_update_date or something similar), a slightly lazier and wonderful approach is to delete those records and let the anti-join insert re-insert them. This would omit the need for the update statement and would be much less code for tables with lots of columns
CodePudding user response:
Pseudocode:
data = select column_time from table2;
foreach data
{
rows = select * from table2 where column_time like data.column_time;
int count = rows.count;
if(count == 0)
{
insert row into table1;
}
else
{
//row exist in both tables
}
}
Do You think it may be faster?
In my case i just add ALL rows from table2 into table1 and truncate table2 at the end. Then if i want to read some data:
a) show all (two) rows: old and copied new
b) show only one, newest row