I have table as shown below. It has 3 columns user
, page
and timestamp
.
------ ------------------ -----------
| user | page | timestamp |
------ ------------------ -----------
| 1 | homepage?c=1234 | 1234 |
| 1 | homepage?c=1234 | 1245 |
| 1 | homepage?c=1234 | 1260 |
| 1 | homepage?c=1234 | 1280 |
| 1 | Signup?=1233 | 1293 |
| 1 | Signup?=121asd | 1303 |
| 1 | Signup?=212 | 1317 |
| 1 | Signup?123213 | 1337 |
| 1 | homepage?c=1234 | 1357 |
| 1 | Hotels | 1370 |
| 1 | Hotels | 1384 |
| 1 | Hotels | 1398 |
| 1 | Signup?=121asd | 1413 |
| 1 | Signup?=121as123 | 1433 |
| 1 | homepage?c=1234 | 1447 |
| 1 | homepage?c=1234 | 1463 |
| 1 | homepage?c=1234 | 1482 |
| 1 | homepage?c=1234 | 1496 |
------ ------------------ -----------
In the above table I want to calculate number of times a user visits a particular page. Each record in the above table is a hit, so just grouping by column page
wont help.
I essentially want to count continuous records for a particular page only once. I only want to increment the counter for a particular page when user visits some other page in between.
Something as shown below:
The user visits homepage 3 times (yellow color), Signup page 2 times (blue color) and Hotels page 1 time (orange in color).
The output expected is as below:
------- ---------- -------- --------
| | Homepage | Signup | Hotels |
------- ---------- -------- --------
| users | 3 | 2 | 1 |
------- ---------- -------- --------
CodePudding user response:
So what you want to do is first get the page name out of the url, so homepage?c=1234 becomes homepage. Then for each visit you can use LAG
to see if the page changed between and use a window SUM
function to sum up all the changes up until this point. Each of the page visit groups will then get a unique grouping number, then it's just a matter of counting the distinct grouping numbers for each of them.
I've split this into individual CTEs for clarity and understanding:
WITH pagesplit AS
(
SELECT user
, (split(page, '?')[safe_ordinal(1)]) AS page
, ts
FROM so.visits
),
page_with_prev AS
(
SELECT user
, page
, COALESCE(LAG(_page) OVER (PARTITION BY user ORDER BY _ts), '?') AS prev_page
, ts
FROM pagesplit
),
count_consecutive AS
(
SELECT user
, page
, SUM(CASE WHEN page != prev_page THEN 1 ELSE NULL END) OVER (PARTITION BY user ORDER BY ts) AS grouping_no
FROM page_with_prev
)
SELECT user
, page
, COUNT(DISTINCT grouping_no) AS visit_count
FROM count_streaks
GROUP BY user, page
Pivoting the result into columns may not make much sense unless you know exactly how many you will have and that no new ones will ever be added. You can do that if you want of course.
CodePudding user response:
Consider also below options
Flatten output
select user, page, count(distinct visit_number) visits
from (
select *, countif(new_visit) over(partition by user order by timestamp) visit_number
from (
select *, page != lag(page, 1, '') over(partition by user order by timestamp) new_visit
from (
select user, split(page, '?')[offset(0)] page, timestamp
from your_table
)
)
)
group by user, page
with result
Or with Pivot
select * from (
select user, page, countif(new_visit) over(partition by user order by timestamp) visit_number
from (
select *, page != lag(page, 1, '') over(partition by user order by timestamp) new_visit
from (
select user, split(page, '?')[offset(0)] page, timestamp
from your_table
)
)
)
pivot (count(distinct visit_number) for lower(page) in ('homepage', 'signup', 'hotels'))
with output