I am in the process of writing some scripts to clean up some data. Because we cannot replicate the production database we had to somehow manually load some test data in two tables with the same structure and same data distribution as the one in production.
We use PostgreSQL 12.8
The tables are huge and it took a lot of effort to populate those two tables. In order to accelerate the inserts we dropped all indexes in both tables. At the end of the load when wanting to recreate the indexes back we find out that by accident some data was loaded twice.
Say our tables look like this:
table_1: table_2:
id id value
---- --------------
1 id-1 1 id-1 v-1
2 id-1 2 id-1 v-2
3 id-2 3 id-1 v-3
4 id-2 4 id-1 v-4
5 id-3 5 id-2 v-5
6 id-2 v-6
7 id-2 v-7
8 id-2 v-8
9 id-3 v-9
10 id-3 v-10
As you can see in table_1
row numbers 2) and 4) are duplicates. We would like to replace them with something else; say id-4
and id-5
. It does not matter how we compute the new ID, it just have to be unique. It can be anything even id-1-dedup
and id-2-dedup
but is very important half of the duplicated ids in table_2
to get the same fix. After applying the fix I would like our two tables to look like below:
table_1: table_2:
id id value
---- --------------
1 id-1 1 id-1 v-1
2 id-4 2 id-1 v-2
3 id-2 3 id-4 v-3
4 id-5 4 id-4 v-4
5 id-3 5 id-2 v-5
6 id-2 v-6
7 id-5 v-7
8 id-5 v-8
9 id-3 v-9
10 id-3 v-10
Which row has gets the new ID is irrelevant we just want to eliminate duplicates from table_1
and reflect the change in table_2
. Pity merge statement is not available in PostgreSQL 12.8
Deleting the data and reinserting it is not really an option because there it was a lot of work to populate value
columns and we'd only do this if no other clever way.
Thank you in advance.
CodePudding user response:
The approach I would take would be to add a serial column to the two tables. You can then run the following updates:
WITH cte as
(SELECT serial_col, row_number() over(partition by id) as rn
FROM table_1)
UPDATE table_1
SET id = table_1.id || '-dup'
FROM cte
WHERE cte.serial_col = table_1.serial_col AND cte.rn = 2;
and
WITH cte as
(SELECT serial_col, row_number() over(partition by id) as rn
FROM table_2)
UPDATE table_2
SET id = table_2.id || '-dup'
FROM cte
WHERE cte.serial_col = table_2.serial_col AND cte.rn > 2;
Note this presupposes that you just have duplicates and no triplicates.
After running the updates you can drop the serial columns if necessary.