Home > Net >  Remove duplicates based on multiple columns and datetime
Remove duplicates based on multiple columns and datetime

Time:11-01

I want to remove duplicated rows that has same visitor_id based on the earlier datetime. For example, for visitor_id 2643331144, I want to pick row 1 as it has the earlier visit date time, and also keep channel and visit_page for the same row. And for visitor_id 1092581226, I want to keep row 3.

rowno visitor_id datetime channel visit_page
1 2643331144 10/3/2021 4:05:29 PM email landing page
2 2643331144 10/3/2021 4:05:39 PM organic search landing page
3 1092581226 10/7/2021 1:08:12 PM email price reduced
4 1092581226 10/7/2021 1:08:44 PM organic search landing page
5 1092581226 10/7/2021 1:09:04 PM paid search unknow
6 1092581226 10/7/2021 1:09:05 PM email price reduced

And I want a result look like below:

rowno visitor_id datetime channel visit_page
1 2643331144 10/3/2021 4:05:29 PM email landing page
2 1092581226 10/7/2021 1:08:12 PM email price reduced

I used below query but the total visitor number is over-deduped. But without using partition, total number will be double counted as same visitor has multiple channels and pages during same session.

with T as
(select *, row_number() over (partition by visitor_id order by datetime asc) as rank
from table A)

select distinct visitor_id, channel, visit_page
from T
where rank=1

CodePudding user response:

If the only problem is rownum in final output you can "recount" it with row_number() over (order by datetime asc) as rownum in final select:

with cte (
   visitor_id 
  ,datetime   
  ,channel    
  ,visit_page 
) as (
    values 
     (2643331144,'10/3/2021 4:05:29 PM','email','landing page'),
 (2643331144,'10/3/2021 4:05:39 PM','organic search','landing page'),
 (1092581226,'10/7/2021 1:08:12 PM','email','price reduced'),
 (1092581226,'10/7/2021 1:08:44 PM','organic search','landing page'),
 (1092581226,'10/7/2021 1:09:04 PM','paid search','unknow'),
 (1092581226,'10/7/2021 1:09:05 PM','email','price reduced')
)

select row_number() over (order by datetime asc) as rownum,
    visitor_id,
    datetime,
    channel,
    visit_page
from (
        -- inlined your WITH clause into subquery
        select *,
            row_number() over (
                partition by visitor_id
                order by datetime asc
            ) as rank
        from cte
    )
where rank = 1

Output:

rownum visitor_id datetime channel visit_page
1 2643331144 10/3/2021 4:05:29 PM email landing page
2 1092581226 10/7/2021 1:08:12 PM email price reduced
  • Related