Identifying the start and end of a sequence in R-CodePudding

The following is a sample dataset:

df <- structure(list(
sess_id = c(189, 189, 189, 189, 189, 189, 189, 189, 189, 189, 124, 124,124,124,124),
Activity = c("home", "pg1", "pg2", "cart", "pg3", "pg2", "home", "pg3", "cart","pg1","home","pg2", "pg3", "cart", "pg2"),
ts = c("2002-06-09 12:45:40","2002-06-09 12:46:01","2002-06-09 12:46:30","2002-06-09 12:47:00","2002-06-09 12:47:50", "2002-06-09 12:49:51", "2002-06-09 12:49:59", "2002-06-09 13:00:00", "2002-06-09 13:30:00", "2002-06-09 13:31:02", "2002-06-09 13:31:45", "2002-06-09 13:32:28", "2002-06-09 13:32:30", "2002-06-09 13:32:32", "2002-06-09 13:33:28")),
.Names = c("sess_id", "Activity", "ts"),
row.names = c(NA, -15L),
class = "data.frame")

I am trying to analyze the sequence in blocks. I want to extract a block of user behavior where the start point is 'home' and the end point is 'cart'. For instance, for sess_id 189, I would have two blocks: home, pg1, pg2, cart and home, pg3, cart.

My idea is to create a new column called index where a 1 denotes part of a block. If 0, I can filter out those rows as not being part of the home to cart sequence.

With the following code I am able to mark the start and end points, but not the Activity between home and cart.

df %>%
  group_by(sess_id) %>%
  arrange(ts) %>% 
  mutate(index = case_when(Activity == "home" | Activity == "cart" ~ 1, TRUE ~ 0)) %>% 
  mutate(index = as.numeric(index)) %>%
  ungroup()%>%
  mutate(block_index = cumsum(index))

I am expecting an output something like this:

sess_id	Activity	ts	index
189	home	2002-06-09 12:45:40	1
189	pg1	2002-06-09 12:46:01	1
189	pg2	2002-06-09 12:46:30	1
189	cart	2002-06-09 12:47:00	1
189	pg3	2002-06-09 12:47:50	0
189	pg2	2002-06-09 12:49:51	0
189	home	2002-06-09 12:49:59	1
189	pg3	2002-06-09 13:00:00	1
189	cart	2002-06-09 13:30:00	1
189	pg1	2002-06-09 13:31:02	0
124	home	2002-06-09 13:31:45	1
124	pg2	2002-06-09 13:32:28	1
124	pg3	2002-06-09 13:32:30	1
124	cart	2002-06-09 13:32:32	1
124	pg2	2002-06-09 13:33:28	0

I am open to suggestions and alternatives too.

CodePudding user response：

A simple base R approach is to create a Vectorize'd function of seq and using grep to identify the "from" rows (that contain "home") and "to" rows (that contain "cart"), then subsetting:

seqV <- Vectorize(seq.default, 
                  vectorize.args = c("from", "to"))

keeps <- unlist(seqV(grep("home", df$Activity), 
                     grep("cart", df$Activity)))

blocks <- df[keeps, ]

This will create a new data frame blocks that will only contain the desired observations. However, if you wanted to make an index variable and keep the full dataset, you can add that directly in df by:

df[keeps, "index"] <- 1
df[setdiff(1:nrow(df),keeps), "index"] <- 0

#    sess_id Activity                  ts index
# 1      189     home 2002-06-09 12:45:40     1
# 2      189      pg1 2002-06-09 12:46:01     1
# 3      189      pg2 2002-06-09 12:46:30     1
# 4      189     cart 2002-06-09 12:47:00     1
# 5      189      pg3 2002-06-09 12:47:50     0
# 6      189      pg2 2002-06-09 12:49:51     0
# 7      189     home 2002-06-09 12:49:59     1
# 8      189      pg3 2002-06-09 13:00:00     1
# 9      189     cart 2002-06-09 13:30:00     1
# 10     189      pg1 2002-06-09 13:31:02     0
# 11     124     home 2002-06-09 13:31:45     1
# 12     124      pg2 2002-06-09 13:32:28     1
# 13     124      pg3 2002-06-09 13:32:30     1
# 14     124     cart 2002-06-09 13:32:32     1
# 15     124      pg2 2002-06-09 13:33:28     0

CodePudding user response：

Another way with sequence:

w = which(df$Activity %in% c("home", "cart"))
idx = sequence(w[c(F, T)] - w[c(T, F)]   1, w[c(T, F)])
df$index <-  (seq_len(nrow(df)) %in% idx)

output

   sess_id Activity                  ts index
1      189     home 2002-06-09 12:45:40     1
2      189      pg1 2002-06-09 12:46:01     1
3      189      pg2 2002-06-09 12:46:30     1
4      189     cart 2002-06-09 12:47:00     1
5      189      pg3 2002-06-09 12:47:50     0
6      189      pg2 2002-06-09 12:49:51     0
7      189     home 2002-06-09 12:49:59     1
8      189      pg3 2002-06-09 13:00:00     1
9      189     cart 2002-06-09 13:30:00     1
10     189      pg1 2002-06-09 13:31:02     0
11     124     home 2002-06-09 13:31:45     1
12     124      pg2 2002-06-09 13:32:28     1
13     124      pg3 2002-06-09 13:32:30     1
14     124     cart 2002-06-09 13:32:32     1
15     124      pg2 2002-06-09 13:33:28     0

CodePudding user response：

Using purrr::accumulate

library(dplyr)
library(purrr)

df |>
    group_by(sess_id) |>
    mutate(block = accumulate(Activity,
                              ~if(.y == "home")
                                   1
                               else if (.y == "cart")
                                   -1
                               else if (.x < 0)
                                   0
                               else .x,
                              .init = 0) |>
               tail(-1) |>
               abs())

##>    sess_id Activity ts                  block
##>      <dbl> <chr>    <chr>               <dbl>
##>  1     189 home     2002-06-09 12:45:40     1
##>  2     189 pg1      2002-06-09 12:46:01     1
##>  3     189 pg2      2002-06-09 12:46:30     1
##>  4     189 cart     2002-06-09 12:47:00     1
##>  5     189 pg3      2002-06-09 12:47:50     0
##>  6     189 pg2      2002-06-09 12:49:51     0
##>  7     189 home     2002-06-09 12:49:59     1
##>  8     189 pg3      2002-06-09 13:00:00     1
##>  9     189 cart     2002-06-09 13:30:00     1
##> 10     189 pg1      2002-06-09 13:31:02     0
##> 11     124 home     2002-06-09 13:31:45     1
##> 12     124 pg2      2002-06-09 13:32:28     1
##> 13     124 pg3      2002-06-09 13:32:30     1
##> 14     124 cart     2002-06-09 13:32:32     1
##> 15     124 pg2      2002-06-09 13:33:28     0