How to shuffle column values while keeping same dataset-CodePudding

I would like to to shuffle some columns from a table in Postgres database. I have 2 millions rows. I need to update all not null values by another.

I need to keep the same dataset. It's not possible to have the same value two times. It's not possible to swap data with next values because, if I do the same process with another column I will keep the same link. It's to anonymise my database. Just need to shuffle data and keep the dataset.

exemple (change firstname and lastname):

id	firstname	lastname
1	albert	einsten
2	isaac	newton
3		curie
4	alexandre	Graham Bell
5	thomas	Edison

shuffle firstname column:

id	firstname	lastname
1	isaac	Graham Bell
2	albert	Edison
3		einsten
4	thomas	newton
5	alexandre	curie

How to do this with a speedy process?

CodePudding user response：

Given how general your requirements are about the order of the shuffle, I'm not sure how much this will help practically, but I think it answers your question:

update test
SET firstname = t2.firstname
FROM
(
 SELECT id, COALESCE(LAG(firstname, 1) OVER (ORDER BY id RANGE UNBOUNDED PRECEDING), LAST_VALUE(firstname) OVER (ORDER BY id RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) firstname
 FROM test t2
 WHERE firstname IS NOT NULL
) t2
WHERE test.id = t2.id

The idea here is that the inner query gets the values shifted by one (ignoring nulls). The COALESCE is used since the first one doesn't have a proceeding entry, so it falls back on some LAST_VALUE logic to get the last value (i.e. it behaves as if the shift loops around).

The surrounding UPDATE statement joins test to the subquery to actually update the data.

You can see it working in this Fiddle.

CodePudding user response：

This shuffles values in the column firstname in a perfectly random fashion:

UPDATE test t0
SET    firstname = t2.firstname
FROM  (SELECT row_number() OVER (ORDER BY random()) AS rn, id        FROM test WHERE firstname IS NOT NULL) t1
JOIN  (SELECT row_number() OVER (ORDER BY random()) AS rn, firstname FROM test WHERE firstname IS NOT NULL) t2 USING (rn)
WHERE  t0.id = t1.id
AND    t0.firstname IS NOT NULL;

"Perfectly random" includes the possibility that some columns might retain their original values. (The more rows the smaller the chance.) This is actually best for anonymizing data. Then values are truly random. If we force a switch, readers will get the minimum information that a different value was associated with a given ID.

It also observes your surprising rule to only shuffle not null values.

Repeat for every column you need to shuffle.

Without excluding NULL values, this single query works more cheaply for any number of columns:

UPDATE test t0
SET    firstname = t2.firstname
     , lastname  = t3.lastname
FROM  (SELECT row_number() OVER (ORDER BY random()) AS rn, id        FROM test) t1
JOIN  (SELECT row_number() OVER (ORDER BY random()) AS rn, firstname FROM test) t2 USING (rn)
JOIN  (SELECT row_number() OVER (ORDER BY random()) AS rn, lastname  FROM test) t3 USING (rn)
WHERE  t0.id = t1.id;

If id is a gap-less sequence we can remove t1 from the equation, and join t0.id to t2.rn. (Wouldn't work while excluding NULL values.)

fiddle