I'm trying to remove duplicates from a table. Essentially I have a table like:
TEST_TABLE
User | Value1 | Value2 | Value 3 |
---|---|---|---|
user1 | a | b | c |
user2 | d | e | f |
user1 | a | b | c |
user3 | a | b | c |
... | ... | ... | ... |
so on and so forth, where user1 will have repeated rows, but user2 and 3 don't have any duplicates.
I am able to see all of the duplicates by doing something like
SELECT *,COUNT(*) as count_num_rows
FROM TEST_TABLE
GROUP BY *
HAVING count_num_rows >1
But I'm not sure how to take it from here. The table does not have a good primary key for me to use, so my original inclination was to add an ID to everything, and then from the above code, Delete the MAX(ID), and then remove the ID column, however this seems like it would be a significant amount of calc and its a very large table, and I want to do it more efficiently than that. Is there a better way?
CodePudding user response:
Using QUALIFY
:
--CREATE OR REPLACE TABLE TEST_TABLE AS
SELECT User, Value1, Value2, Value3
FROM TEST_TABLE
QUALIFY ROW_NUMBER() OVER(PARTITION BY User ORDER BY ...) = 1
PARTITION BY
- key columns,
ORDER BY
- logic of choosing which row to preserve
CodePudding user response:
Or also:
ALTER TABLE test_table RENAME TO test_table_dropme;
CREATE TABLE test_table LIKE test_table_dropme;
INSERT INTO test_table SELECT DISTINCT * FROM test_table_dropme;
--- check the contents of the new "test_table" ...
--- check the grants active on "test_table_dropme" and re-apply them
--- to "test_table" ...
DROP TABLE test_table_dropme;