I want to remove duplicated rows that has same visitor_id based on the earlier datetime. For example, for visitor_id 2643331144, I want to pick row 1 as it has the earlier visit date time, and also keep channel and visit_page for the same row. And for visitor_id 1092581226, I want to keep row 3.
rowno | visitor_id | datetime | channel | visit_page |
---|---|---|---|---|
1 | 2643331144 | 10/3/2021 4:05:29 PM | landing page | |
2 | 2643331144 | 10/3/2021 4:05:39 PM | organic search | landing page |
3 | 1092581226 | 10/7/2021 1:08:12 PM | price reduced | |
4 | 1092581226 | 10/7/2021 1:08:44 PM | organic search | landing page |
5 | 1092581226 | 10/7/2021 1:09:04 PM | paid search | unknow |
6 | 1092581226 | 10/7/2021 1:09:05 PM | price reduced |
And I want a result look like below:
rowno | visitor_id | datetime | channel | visit_page |
---|---|---|---|---|
1 | 2643331144 | 10/3/2021 4:05:29 PM | landing page | |
2 | 1092581226 | 10/7/2021 1:08:12 PM | price reduced |
I used below query but the total visitor number is over-deduped. But without using partition, total number will be double counted as same visitor has multiple channels and pages during same session.
with T as
(select *, row_number() over (partition by visitor_id order by datetime asc) as rank
from table A)
select distinct visitor_id, channel, visit_page
from T
where rank=1
CodePudding user response:
If the only problem is rownum in final output you can "recount" it with row_number() over (order by datetime asc) as rownum
in final select:
with cte (
visitor_id
,datetime
,channel
,visit_page
) as (
values
(2643331144,'10/3/2021 4:05:29 PM','email','landing page'),
(2643331144,'10/3/2021 4:05:39 PM','organic search','landing page'),
(1092581226,'10/7/2021 1:08:12 PM','email','price reduced'),
(1092581226,'10/7/2021 1:08:44 PM','organic search','landing page'),
(1092581226,'10/7/2021 1:09:04 PM','paid search','unknow'),
(1092581226,'10/7/2021 1:09:05 PM','email','price reduced')
)
select row_number() over (order by datetime asc) as rownum,
visitor_id,
datetime,
channel,
visit_page
from (
-- inlined your WITH clause into subquery
select *,
row_number() over (
partition by visitor_id
order by datetime asc
) as rank
from cte
)
where rank = 1
Output:
rownum | visitor_id | datetime | channel | visit_page |
---|---|---|---|---|
1 | 2643331144 | 10/3/2021 4:05:29 PM | landing page | |
2 | 1092581226 | 10/7/2021 1:08:12 PM | price reduced |