Removing duplicates in AWS Redshift using SQL-CodePudding

Summary: I've got a table which looks something like this

Entity Id	Entity Name
01	Entity1
02	Entity2
01	Entity1
03	Entity3

Question I'm trying to remove duplicates using SQL proc but struggling to find a solution. Is there a way to remove duplicates in this case using just SQL? Ideally, I'd like to create a procedure which I can call when required to remove duplicates.

I tried using CTE and row_number() but struggling to get it working...

CodePudding user response：

If I glean your data correctly, a simple distinct select should work here:

SELECT DISTINCT EntityId, EntityName
FROM yourTable;

CodePudding user response：

Your on the right track with row_number() but it looks like you are trying to delete all but one of the duplicated rows. This isn't possible since there is no way to tell one copy from another. You will need to do this in a few steps.

Save a single copy of duplicated rows into a temp table. Using row_number() where this value equals 2 is a good way.
Delete all matching rows from the original table.
Insert the values from the temp table back into the original table.

You can do this in a transaction block so that no other user of the database will see the table with the rows missing.