Home > Blockchain >  more efficient way to select duplicate users
more efficient way to select duplicate users

Time:01-12

Im trying to select * from all duplicate rows in users, where a duplicate is defined as two users sharing the same first_name and last_name. (I need to process the other columns that might differ)

Im using MySQL 8.0.28.

My first try was to literally translate my requirement:

select * from `users` AS u1 where exists (select 1 from `users` AS u2 WHERE `u2`.`first_name` = `u1`.`first_name` AND `u2`.`last_name` = `u1`.`last_name` AND `u2`.`id` != `u1`.`id`)

Which, obviously, has a horrendous execution time.

My current query is

SELECT * from users where  Concat(first_name," ",last_name) IN (select Concat(first_name," ",last_name) from `users` GROUP BY first_name, last_name HAVING COUNT(*)>1)

which is vastly more efficient, but still takes more than 100ms for 8000 records. I suppose a solution that doesn't use concat could benefit from indicies and would not need to calculate the result for each row.

Also, I couldn't get group by to work because I need so select all columns of all rows that are duplicates, not just the distinct first_name's and last_name's. Also because I don't want to disable ONLY_FULL_GROUP_BY (not sure if disabling that would help anyway).

Is there a more efficient, proper way to select these duplicate rows?

CodePudding user response:

I would just use an aggregation approach here:

SELECT *
FROM users
WHERE (first_name, last_name) IN (
    SELECT first_name, last_name
    FROM users
    GROUP BY 1, 2
    HAVING COUNT(*) > 1
);

On MySQL 8 , we can also use COUNT() as an analytic function here:

WITH cte AS (
    SELECT *, COUNT(*) OVER (PARTITION BY first_name, last_name) AS cnt
    FROM users
)

SELECT *
FROM cte
WHERE cnt > 1;
  • Related