I have the below dataset, and I am trying to create a more meaningful path.
Row# | Session | Click | Page |
---|---|---|---|
1 | 123 | Enter | Pg1 |
2 | 123 | phpbutton | Pg1 |
3 | 123 | Enter | Pg2 |
4 | 123 | Enter | Pg3 |
5 | 123 | Form1 | Pg3 |
6 | 123 | Form2 | Pg3 |
7 | 123 | Form1 | Pg3 |
8 | 123 | Form1 | Pg3 |
9 | 123 | abcbutton | Pg3 |
10 | 123 | Enter | Pg1 |
11 | 123 | xyzselect | Pg1 |
12 | 123 | Enter | Pg4 |
13 | 123 | Enter | Pg3 |
14 | 123 | Back | Pg3 |
15 | 123 | Enter | Pg1 |
I would like the outcome to look this:
Session | Activity |
---|---|
123 | Pg1 |
123 | phpbutton |
123 | Pg2 |
123 | Pg3 |
123 | Form1 |
123 | Form2 |
123 | Form1 |
123 | abcbutton |
123 | Pg1 |
123 | xyzselect |
123 | Pg4 |
123 | Pg3 |
123 | Back |
123 | Pg1 |
If the Click column has Enter, then the Activity column should show the Page. But, if the subsequent page is equal to the previous page, then the Activity column should show the value from the Click column. For instance, row# 1 and 2 have the same Page numbers, so I would like the Activity column to show, Pg1, then, phpbutton. But, if the Click column has two or more subsequent same values, as seen in Row# 7 and 8, I would like the Activity column to show just one entry of Form 1.
Thanks a lot.
CodePudding user response:
Try this
df |> group_by(Session) |>
mutate(Activity = case_when(Click == "Enter" ~ Page ,
lag(Page) == Page ~ Click)) |> select(Session , Activity)
CodePudding user response:
library(dplyr)
dat$activity <- ifelse(dat$click == "Enter" & (lag(dat$page) != dat$page | is.na(lag(dat$page))), dat$page,
ifelse(lag(dat$page) == dat$page, dat$click, NA))
row session click page activity
1 1 123 Enter Pg1 Pg1
2 2 123 phpbutton Pg1 phpbutton
3 3 123 Enter Pg2 Pg2
4 4 123 Enter Pg3 Pg3
5 5 123 Form1 Pg3 Form1
6 6 123 Form2 Pg3 Form2
7 7 123 Form1 Pg3 Form1
8 8 123 Form1 Pg3 Form1
9 9 123 abcbutton Pg3 abcbutton
10 10 123 Enter Pg1 Pg1
11 11 123 xyzselect Pg1 xyzselect
12 12 123 Enter Pg4 Pg4
13 13 123 Enter Pg3 Pg3
14 14 123 Back Pg3 Back
15 15 123 Enter Pg1 Pg1
You could remove duplicate, consecutive rows like this:
dat[cumsum(rle(paste0(dat$session, dat$click, dat$page, dat$activity))$length),]
row session click page activity
1 1 123 Enter Pg1 Pg1
2 2 123 phpbutton Pg1 phpbutton
3 3 123 Enter Pg2 Pg2
4 4 123 Enter Pg3 Pg3
5 5 123 Form1 Pg3 Form1
6 6 123 Form2 Pg3 Form2
8 8 123 Form1 Pg3 Form1
9 9 123 abcbutton Pg3 abcbutton
10 10 123 Enter Pg1 Pg1
11 11 123 xyzselect Pg1 xyzselect
12 12 123 Enter Pg4 Pg4
13 13 123 Enter Pg3 Pg3
14 14 123 Back Pg3 Back
15 15 123 Enter Pg1 Pg1
data:
structure(list(row = 1:15, session = c(123L, 123L, 123L, 123L,
123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L, 123L
), click = c("Enter", "phpbutton", "Enter", "Enter", "Form1",
"Form2", "Form1", "Form1", "abcbutton", "Enter", "xyzselect",
"Enter", "Enter", "Back", "Enter"), page = c("Pg1", "Pg1", "Pg2",
"Pg3", "Pg3", "Pg3", "Pg3", "Pg3", "Pg3", "Pg1", "Pg1", "Pg4",
"Pg3", "Pg3", "Pg1")), row.names = c(NA, -15L), class = "data.frame")