The following is a sample dataset:
df <- structure(list(
sess_id = c(189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 124, 124,124,124,124),
Activity = c("home", "pg1", "pg2", "cart", "pg3", "pg2", "home", "pg3", "cart","pg1","home","pg2", "pg3", "cart", "pg2"),
ts = c("2002-06-09 12:45:40","2002-06-09 12:46:01","2002-06-09 12:46:30","2002-06-09 12:47:00","2002-06-09 12:47:50", "2002-06-09 12:49:51", "2002-06-09 12:49:59", "2002-06-09 13:00:00", "2002-06-09 13:30:00", "2002-06-09 13:31:02", "2002-06-09 13:31:45", "2002-06-09 13:32:28", "2002-06-09 13:32:30", "2002-06-09 13:32:32", "2002-06-09 13:33:28")),
.Names = c("sess_id", "Activity", "ts"),
row.names = c(NA, -15L),
class = "data.frame")
I am trying to analyze the sequence in blocks. I want to extract a block of user behavior where the start point is 'home' and the end point is 'cart'. For instance, for sess_id 189, I would have two blocks: home, pg1, pg2, cart and home, pg3, cart.
My idea is to create a new column called index where a 1 denotes part of a block. If 0, I can filter out those rows as not being part of the home to cart sequence.
With the following code I am able to mark the start and end points, but not the Activity between home and cart.
df %>%
group_by(sess_id) %>%
arrange(ts) %>%
mutate(index = case_when(Activity == "home" | Activity == "cart" ~ 1, TRUE ~ 0)) %>%
mutate(index = as.numeric(index)) %>%
ungroup()%>%
mutate(block_index = cumsum(index))
I am expecting an output something like this:
sess_id | Activity | ts | index |
---|---|---|---|
189 | home | 2002-06-09 12:45:40 | 1 |
189 | pg1 | 2002-06-09 12:46:01 | 1 |
189 | pg2 | 2002-06-09 12:46:30 | 1 |
189 | cart | 2002-06-09 12:47:00 | 1 |
189 | pg3 | 2002-06-09 12:47:50 | 0 |
189 | pg2 | 2002-06-09 12:49:51 | 0 |
189 | home | 2002-06-09 12:49:59 | 1 |
189 | pg3 | 2002-06-09 13:00:00 | 1 |
189 | cart | 2002-06-09 13:30:00 | 1 |
189 | pg1 | 2002-06-09 13:31:02 | 0 |
124 | home | 2002-06-09 13:31:45 | 1 |
124 | pg2 | 2002-06-09 13:32:28 | 1 |
124 | pg3 | 2002-06-09 13:32:30 | 1 |
124 | cart | 2002-06-09 13:32:32 | 1 |
124 | pg2 | 2002-06-09 13:33:28 | 0 |
I am open to suggestions and alternatives too.
CodePudding user response:
A simple base R approach is to create a Vectorize
'd function of seq
and using grep
to identify the "from" rows (that contain "home") and "to" rows (that contain "cart"), then subsetting:
seqV <- Vectorize(seq.default,
vectorize.args = c("from", "to"))
keeps <- unlist(seqV(grep("home", df$Activity),
grep("cart", df$Activity)))
blocks <- df[keeps, ]
This will create a new data frame blocks
that will only contain the desired observations. However, if you wanted to make an index variable and keep the full dataset, you can add that directly in df
by:
df[keeps, "index"] <- 1
df[setdiff(1:nrow(df),keeps), "index"] <- 0
# sess_id Activity ts index
# 1 189 home 2002-06-09 12:45:40 1
# 2 189 pg1 2002-06-09 12:46:01 1
# 3 189 pg2 2002-06-09 12:46:30 1
# 4 189 cart 2002-06-09 12:47:00 1
# 5 189 pg3 2002-06-09 12:47:50 0
# 6 189 pg2 2002-06-09 12:49:51 0
# 7 189 home 2002-06-09 12:49:59 1
# 8 189 pg3 2002-06-09 13:00:00 1
# 9 189 cart 2002-06-09 13:30:00 1
# 10 189 pg1 2002-06-09 13:31:02 0
# 11 124 home 2002-06-09 13:31:45 1
# 12 124 pg2 2002-06-09 13:32:28 1
# 13 124 pg3 2002-06-09 13:32:30 1
# 14 124 cart 2002-06-09 13:32:32 1
# 15 124 pg2 2002-06-09 13:33:28 0
CodePudding user response:
Another way with sequence
:
w = which(df$Activity %in% c("home", "cart"))
idx = sequence(w[c(F, T)] - w[c(T, F)] 1, w[c(T, F)])
df$index <- (seq_len(nrow(df)) %in% idx)
output
sess_id Activity ts index
1 189 home 2002-06-09 12:45:40 1
2 189 pg1 2002-06-09 12:46:01 1
3 189 pg2 2002-06-09 12:46:30 1
4 189 cart 2002-06-09 12:47:00 1
5 189 pg3 2002-06-09 12:47:50 0
6 189 pg2 2002-06-09 12:49:51 0
7 189 home 2002-06-09 12:49:59 1
8 189 pg3 2002-06-09 13:00:00 1
9 189 cart 2002-06-09 13:30:00 1
10 189 pg1 2002-06-09 13:31:02 0
11 124 home 2002-06-09 13:31:45 1
12 124 pg2 2002-06-09 13:32:28 1
13 124 pg3 2002-06-09 13:32:30 1
14 124 cart 2002-06-09 13:32:32 1
15 124 pg2 2002-06-09 13:33:28 0
CodePudding user response:
Using purrr::accumulate
library(dplyr)
library(purrr)
df |>
group_by(sess_id) |>
mutate(block = accumulate(Activity,
~if(.y == "home")
1
else if (.y == "cart")
-1
else if (.x < 0)
0
else .x,
.init = 0) |>
tail(-1) |>
abs())
##> sess_id Activity ts block
##> <dbl> <chr> <chr> <dbl>
##> 1 189 home 2002-06-09 12:45:40 1
##> 2 189 pg1 2002-06-09 12:46:01 1
##> 3 189 pg2 2002-06-09 12:46:30 1
##> 4 189 cart 2002-06-09 12:47:00 1
##> 5 189 pg3 2002-06-09 12:47:50 0
##> 6 189 pg2 2002-06-09 12:49:51 0
##> 7 189 home 2002-06-09 12:49:59 1
##> 8 189 pg3 2002-06-09 13:00:00 1
##> 9 189 cart 2002-06-09 13:30:00 1
##> 10 189 pg1 2002-06-09 13:31:02 0
##> 11 124 home 2002-06-09 13:31:45 1
##> 12 124 pg2 2002-06-09 13:32:28 1
##> 13 124 pg3 2002-06-09 13:32:30 1
##> 14 124 cart 2002-06-09 13:32:32 1
##> 15 124 pg2 2002-06-09 13:33:28 0