I would like to to shuffle some columns from a table in Postgres database. I have 2 millions rows. I need to update all not null values by another.
I need to keep the same dataset. It's not possible to have the same value two times. It's not possible to swap data with next values because, if I do the same process with another column I will keep the same link. It's to anonymise my database. Just need to shuffle data and keep the dataset.
exemple (change firstname and lastname):
id | firstname | lastname |
---|---|---|
1 | albert | einsten |
2 | isaac | newton |
3 | curie | |
4 | alexandre | Graham Bell |
5 | thomas | Edison |
shuffle firstname column:
id | firstname | lastname |
---|---|---|
1 | isaac | Graham Bell |
2 | albert | Edison |
3 | einsten | |
4 | thomas | newton |
5 | alexandre | curie |
How to do this with a speedy process?
CodePudding user response:
Given how general your requirements are about the order of the shuffle, I'm not sure how much this will help practically, but I think it answers your question:
update test
SET firstname = t2.firstname
FROM
(
SELECT id, COALESCE(LAG(firstname, 1) OVER (ORDER BY id RANGE UNBOUNDED PRECEDING), LAST_VALUE(firstname) OVER (ORDER BY id RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) firstname
FROM test t2
WHERE firstname IS NOT NULL
) t2
WHERE test.id = t2.id
The idea here is that the inner query gets the values shifted by one (ignoring nulls). The COALESCE
is used since the first one doesn't have a proceeding entry, so it falls back on some LAST_VALUE
logic to get the last value (i.e. it behaves as if the shift loops around).
The surrounding UPDATE
statement joins test to the subquery to actually update the data.
You can see it working in this Fiddle.
CodePudding user response:
This shuffles values in the column firstname
in a perfectly random fashion:
UPDATE test t0
SET firstname = t2.firstname
FROM (SELECT row_number() OVER (ORDER BY random()) AS rn, id FROM test WHERE firstname IS NOT NULL) t1
JOIN (SELECT row_number() OVER (ORDER BY random()) AS rn, firstname FROM test WHERE firstname IS NOT NULL) t2 USING (rn)
WHERE t0.id = t1.id
AND t0.firstname IS NOT NULL;
"Perfectly random" includes the possibility that some columns might retain their original values. (The more rows the smaller the chance.) This is actually best for anonymizing data. Then values are truly random. If we force a switch, readers will get the minimum information that a different value was associated with a given ID.
It also observes your surprising rule to only shuffle not null values.
Repeat for every column you need to shuffle.
Without excluding NULL
values, this single query works more cheaply for any number of columns:
UPDATE test t0
SET firstname = t2.firstname
, lastname = t3.lastname
FROM (SELECT row_number() OVER (ORDER BY random()) AS rn, id FROM test) t1
JOIN (SELECT row_number() OVER (ORDER BY random()) AS rn, firstname FROM test) t2 USING (rn)
JOIN (SELECT row_number() OVER (ORDER BY random()) AS rn, lastname FROM test) t3 USING (rn)
WHERE t0.id = t1.id;
If id
is a gap-less sequence we can remove t1
from the equation, and join t0.id
to t2.rn
. (Wouldn't work while excluding NULL values.)