Home > Blockchain >  Snowflake - Remove duplicate rows based on conditions met
Snowflake - Remove duplicate rows based on conditions met

Time:03-17

I have the below table that has duplicate start and end for the same id and same file (because I accidentally loaded the files twice from s3). So there should be only 1 filename for each id / start / end

when I run this query in my table:

SELECT filename, id, start, end from table where id = '262627';

I see the below:

filename id start end click_total
name_2022/01/01.csv 262627 2022-01-01 2022-01-02 142
name_2022/01/01.csv 262627 2022-01-01 2022-01-02 142
name_2022/01/02.csv 262627 2022-01-02 2022-01-03 753
name_2022/01/02.csv 262627 2022-01-02 2022-01-03 753

but it should look like this:

filename id start end click_total
name_2022/01/01.csv 262627 2022-01-01 2022-01-02 142
name_2022/01/02.csv 262627 2022-01-02 2022-01-03 753

Is there a query I can run to remove the duplicate rows, without hardcoding each id in the statement?

CodePudding user response:

you can use row_number and pickup only where row_number =1 and delete other rows.

select * from (
    select column1 as filename,column2 as id, column3 as start1, column4 as end1,
    row_number()   over (partition by filename, id order by filename, id) as rnum
    from values
    (('name_2022/01/01.csv')    ,('262627'), ('2022-01-01'),('2022-01-02')),
    (('name_2022/01/01.csv')    ,('262627'), ('2022-01-01'),('2022-01-02')),
    (('name_2022/01/02.csv')    ,('262627'), ('2022-01-02'),('2022-01-03')),
    (('name_2022/01/02.csv')    ,('262627'), ('2022-01-02'),('2022-01-03'))
)  where rnum = 1
  ;

CodePudding user response:

please find the example where in creating a intermediate table to store the distinct rows.

create or replace table tab1 as (select *
from values
    (('name_2022/01/01.csv')    ,('262627'), ('2022-01-01'),('2022-01-02')),
    (('name_2022/01/01.csv')    ,('262627'), ('2022-01-01'),('2022-01-02')),
    (('name_2022/01/02.csv')    ,('262627'), ('2022-01-02'),('2022-01-03')),
    (('name_2022/01/02.csv')    ,('262627'), ('2022-01-02'),('2022-01-03'))
);                                 


-- create intermediate table using analytical functions to remove duplicate rows.
create table uniq_tab1 as  (
select column1 as filename,column2 as id, column3 as start1, column4 as end1
from tab1 
qualify row_number()   over (partition by filename, id order by filename, id) = 1
  );
  
  select * from uniq_tab1;

CodePudding user response:

If all the fields are duplicated, easiest way would be to replace your table with its deduped self

create or replace table your_table as 
select distinct * 
from your_table;
  • Related