How can I flatten a case sensitive column that is also used in a relation-CodePudding

Hello postgres experts,

I have a users table

id	email
1	[email protected]
2	[email protected]

And a posts table

id	userId	content
1	1	foo
2	2	bar

The emails of the two users are the same when ignoring case so I am looking for a way to drop the rows that have duplicated emails and update the userId in the posts table to point to the user that remains.

So the final result will be:

id	email
1	[email protected]

id	userId	content
1	1	foo
2	1	bar

I'm not concerned which version of the email address I end up with (i.e. doesn't have to be the one that's all lowercase).

What's the best way to get this done?

CodePudding user response：

You can update the posts table by taking the smallest id:

update posts p
  set userid = u.user_id
from (
  select min(id) user_id, array_agg(id) as user_ids
  from users u
  group by lower(email)
  having count(*) > 1
) u
where p.userid = any(u.user_ids)
  and p.userid <> u.user_id
;

The SELECT in the derived table returns all users that have more than one email address. The WHERE clause then updates the posts table to use one of the IDs. Once that is done, you can delete the no longer used users

delete from users
where not exists (select * 
                  from posts p 
                  where users.id = p.userid);

Online example

CodePudding user response：

The key to deduplicating rows is breaking ties by some kind of row ID, which you already have. We're going to keep the user with the lowest ID for each case-insensitive email, keeping in mind that there may be more than 2 duplicates for some.

First, for each post, set the user to any user with an equivalent email, for which there exists no other user also with an equivalent email but a lower ID. If I'm doing this right, that should match exactly one user row every time, either the original user or another one.

UPDATE posts p SET "userId" = u2.id
FROM users u, users u2
WHERE u.id = p."userId"
AND lower(u2.email) = lower(u.email)
AND NOT EXISTS (
  SELECT 1
  FROM users u3
  WHERE u3.id < u2.id
  AND lower(u3.email) = lower(u2.email)
);

(You could also do this with a MIN or DISTINCT subquery, but in my experience this is the fastest.)

Then delete any users for which there exists a user with an equivalent email and a lower ID:

DELETE FROM users u
WHERE EXISTS (
  SELECT 1 FROM users u2
  WHERE u2.id < u.id
  AND lower(u2.email) = lower(u.email)
);

Optionally, seal the deal with a uniqueness constraint on lower-case email. I don't remember the exact syntax, but this should be close:

CREATE INDEX user_lower_email ON users(lower(email));