Home > Back-end >  SQL: Deleting Duplicates using Not in and Group by
SQL: Deleting Duplicates using Not in and Group by

Time:09-14

I have the following SQL Syntax to delete duplicate rows, but never are any rows affected.

DELETE FROM content_stacks WHERE id NOT IN (
SELECT id 
FROM content_stacks
GROUP BY user_id, content_id
);

The subquery itself is returning the id list of first entries correctly.

SELECT id 
FROM content_stacks
GROUP BY user_id, content_id

When I'm inserting the results list as a string it is working, too:

DELETE FROM content_stacks WHERE id NOT IN (239,231,217,218,219,232,233,220,230,226,234,235,224,225,221,223,222,227,228,229,236,237,238,216,208,209,210,204,211,212,242,203,240,201,241,205,206,207,213,214,215);

I checked many similar examples and this should be working in my opinion. What am I missing?

CodePudding user response:

First find first rows using ROW_NUMBER Then delete record with row number greater than 1:

WITH CTE  AS (
SELECT id , ROW_NUMBER() OVER(PARTITION BY user_id, content_id, ORDER BY id) rn 
FROM content_stacks
)
DELETE cs 
FROM content_stacks cs
INNER JOIN CTE ON CTE.id = cs.id
WHERE rn > 1

CodePudding user response:

Am sorry to ask but if your deleting why would u need to group the records. Are not just increasing the runtime.

CodePudding user response:

The code from Meyssam Toluie is not working as it is but I made a similar solution with the same idea with rownumbers:

DELETE FROM content_stacks WHERE id IN
(SELECT id FROM (
      SELECT id, ROW_NUMBER() OVER(PARTITION BY user_id, content_id)row_num
      FROM content_stacks
    ) sub
WHERE row_num > 1)

This is working for me now.

My first command did not work because: The group by command does not show all ids in the output, but they are still there, so in fact all ids were returned in the NOT IN id-list. The row number seems to be the easiest way for this problem.

  • Related