I have the below table that has duplicate start and end for the same id and same file (because I accidentally loaded the files twice from s3). So there should be only 1 filename for each id / start / end
when I run this query in my table:
SELECT filename, id, start, end from table where id = '262627';
I see the below:
filename | id | start | end | click_total |
---|---|---|---|---|
name_2022/01/01.csv | 262627 | 2022-01-01 | 2022-01-02 | 142 |
name_2022/01/01.csv | 262627 | 2022-01-01 | 2022-01-02 | 142 |
name_2022/01/02.csv | 262627 | 2022-01-02 | 2022-01-03 | 753 |
name_2022/01/02.csv | 262627 | 2022-01-02 | 2022-01-03 | 753 |
but it should look like this:
filename | id | start | end | click_total |
---|---|---|---|---|
name_2022/01/01.csv | 262627 | 2022-01-01 | 2022-01-02 | 142 |
name_2022/01/02.csv | 262627 | 2022-01-02 | 2022-01-03 | 753 |
Is there a query I can run to remove the duplicate rows, without hardcoding each id in the statement?
CodePudding user response:
you can use row_number and pickup only where row_number =1 and delete other rows.
select * from (
select column1 as filename,column2 as id, column3 as start1, column4 as end1,
row_number() over (partition by filename, id order by filename, id) as rnum
from values
(('name_2022/01/01.csv') ,('262627'), ('2022-01-01'),('2022-01-02')),
(('name_2022/01/01.csv') ,('262627'), ('2022-01-01'),('2022-01-02')),
(('name_2022/01/02.csv') ,('262627'), ('2022-01-02'),('2022-01-03')),
(('name_2022/01/02.csv') ,('262627'), ('2022-01-02'),('2022-01-03'))
) where rnum = 1
;
CodePudding user response:
please find the example where in creating a intermediate table to store the distinct rows.
create or replace table tab1 as (select *
from values
(('name_2022/01/01.csv') ,('262627'), ('2022-01-01'),('2022-01-02')),
(('name_2022/01/01.csv') ,('262627'), ('2022-01-01'),('2022-01-02')),
(('name_2022/01/02.csv') ,('262627'), ('2022-01-02'),('2022-01-03')),
(('name_2022/01/02.csv') ,('262627'), ('2022-01-02'),('2022-01-03'))
);
-- create intermediate table using analytical functions to remove duplicate rows.
create table uniq_tab1 as (
select column1 as filename,column2 as id, column3 as start1, column4 as end1
from tab1
qualify row_number() over (partition by filename, id order by filename, id) = 1
);
select * from uniq_tab1;
CodePudding user response:
If all the fields are duplicated, easiest way would be to replace your table with its deduped self
create or replace table your_table as
select distinct *
from your_table;