Session/User ID | Time Stamp | Page |
---|---|---|
101 | dd - mm - yy 01:00:05 | Page A |
101 | dd - mm - yy 01:00:10 | Page B |
101 | dd - mm - yy 01:00:35 | Page C |
102 | dd - mm - yy 02:00:10 | Page B |
102 | dd - mm - yy 02:00:20 | Page C |
103 | dd - mm - yy 02:00:35 | Page A |
104 | dd - mm - yy 03:00:40 | Page B |
104 | dd - mm - yy 03:00:45 | Page C |
I have a question similar to one asked here: Constructing User Journey - How do you 'self, loop' join?. I want to create a path grouped by session ID and arranged by timestamp. And, I would also like to count how many sessions/users went through the same path.
I would like an outcome like this:
How many users followed the same path:
Path | Frequency |
---|---|
Page A - Page B - Page C | 1 |
Page B - Page C | 2 |
Page A | 1 |
An idea of which user followed what path:
Session/User ID | Path |
---|---|
101 | Page A - Page B - Page C |
102 | Page B - Page C |
103 | Page A |
104 | Page B - Page C |
I would really appreciate some help. Thank you.
CodePudding user response:
You may try
df <- read.table(text = "Session_UserID TimeStamp Page
101 'dd - mm - yy 01:00:05' 'Page A'
101 'dd - mm - yy 01:00:10' 'Page B'
101 'dd - mm - yy 01:00:35' 'Page C'
102 'dd - mm - yy 02:00:10' 'Page B'
102 'dd - mm - yy 02:00:20' 'Page C'
103 'dd - mm - yy 02:00:35' 'Page A'
104 'dd - mm - yy 03:00:40' 'Page B'
104 'dd - mm - yy 03:00:45' 'Page C'", header = T)
librar(dplyr)
df %>%
group_by(Session_UserID) %>%
summarize(path = paste(c(Page), collapse = "-"))
Session_UserID path
<int> <chr>
1 101 Page A-Page B-Page C
2 102 Page B-Page C
3 103 Page A
4 104 Page B-Page C
df %>%
group_by(Session_UserID) %>%
summarize(path = paste(c(Page), collapse = "-")) %>%
group_by(path) %>%
summarize(Frequency = n())
path Frequency
<chr> <int>
1 Page A 1
2 Page A-Page B-Page C 1
3 Page B-Page C 2
CodePudding user response:
I also arranged by time. Maybe the data is not in the right order in the first place:
library(tidyverse)
data <- tibble::tribble(
~id, ~time, ~page,
101L, "2022-06-14 01:00:05", "Page A",
101L, "2022-06-14 01:00:10", "Page B",
101L, "2022-06-14 01:00:35", "Page C",
102L, "2022-06-14 02:00:10", "Page B",
102L, "2022-06-14 02:00:20", "Page C",
103L, "2022-06-14 02:00:35", "Page A",
104L, "2022-06-14 03:00:40", "Page B",
104L, "2022-06-14 03:00:45", "Page C"
)
data %>%
type_convert() %>%
group_by(id) %>%
arrange(time) %>%
summarise(
path = page %>% paste0(collapse = "-")
) %>%
count(path)
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> time = col_datetime(format = ""),
#> page = col_character()
#> )
#> # A tibble: 3 × 2
#> path n
#> <chr> <int>
#> 1 Page A 1
#> 2 Page A-Page B-Page C 1
#> 3 Page B-Page C 2
Created on 2022-06-14 by the reprex package (v2.0.0)