Counting matching values within past 30 seconds by group-CodePudding

I am working with a df similar to this:

df <- data.frame(p1 = c("Tom", "Tom", "Tom", "Tom", "Tom", "Tom", "Brad", "Brad", "Brad", "Brad", "Brad", "Brad", "Brad","Brad", "Brad", "Brad" ),
              elapsed_time = c(0, 10, 17, 28, 47, 65, 83, 100, 135, 180, 210, 225, 237, 253, 276, 281),
              event_type = c("start of period", "play", "play", "timeout", "play", "play", "play", "play", "play", "timeout", "play", "play", "play", "play",
                             "play", "play"),
              scored = c(NA, NA, "Tom", NA, NA, "Tom", NA, NA, NA, NA, NA, "Brad", NA, NA, "Brad", "Brad" ),
             timesincelastbreak = c(0, 10, 17, 0, 19, 37, 0, 17, 52, 0, 30, 45, 57, 73, 96, 101))

that outputs this:

 p1    elapsed_time event_type      scored timesincelastbreak
   <chr>        <dbl> <chr>           <chr>               <dbl>
 1 Tom              0 start of period NA                      0
 2 Tom             10 play            NA                     10
 3 Tom             17 play            Tom                    17
 4 Tom             28 timeout         NA                      0
 5 Tom             47 play            NA                     19
 6 Tom             65 play            Tom                    37
 7 Brad            83 play            NA                      0
 8 Brad           100 play            NA                     17
 9 Brad           135 play            NA                     52
10 Brad           180 timeout         NA                      0
11 Brad           210 play            NA                     30
12 Brad           225 play            Brad                   45
13 Brad           237 play            NA                     57
14 Brad           253 play            NA                     73
15 Brad           276 play            Brad                   96
16 Brad           281 play            Brad                  101

I'd like to add a column (timesscored) that adds together the number of times p1 == scored in the 30 seconds immediately preceding each row, based on timesincelastbreak.

Also, timesscored needs to reset every time that timesincelastbreak is equal to zero.

The desired output is:

p1    elapsed_time event_type      scored timesincelastbreak    timesscored   
   <chr>        <dbl> <chr>           <chr>               <dbl>      <dbl>     
 1 Tom              0 start of period NA                      0          0
 2 Tom             10 play            NA                     10          0
 3 Tom             17 play            Tom                    17          1
 4 Tom             28 timeout         NA                      0          0
 5 Tom             47 play            NA                     19          0
 6 Tom             65 play            Tom                    37          1
 7 Brad            83 play            NA                      0          0
 8 Brad           100 play            NA                     17          0
 9 Brad           135 play            NA                     52          0
10 Brad           180 timeout         NA                      0          0
11 Brad           210 play            NA                     30          0
12 Brad           225 play            Brad                   45          1
13 Brad           237 play            NA                     57          1
14 Brad           253 play            NA                     73          1
15 Brad           276 play            Brad                   96          1
16 Brad           281 play            Brad                  101          2

I've tried using cumsum(df$p1 == df$scored) but I don't know how to specify the interval of 30 for timesincelastbreak.

CodePudding user response：

Here’s a solution based on adding the past 30 seconds for each row using purrr::map_int().

library(dplyr)
library(purrr)

df %>%
  group_by(block = cumsum(timesincelastbreak == 0)) %>%
  mutate(timescored = map_int(
    timesincelastbreak, 
    \(cur_time) sum(
      between(timesincelastbreak, cur_time - 30, cur_time) & scored == p1, 
      na.rm = TRUE
    )
  )) %>%
  ungroup() %>%
  select(!block)

# A tibble: 16 × 6
   p1    elapsed_time event_type      scored timesincelastbreak timescored
   <chr>        <dbl> <chr>           <chr>               <dbl>      <int>
 1 Tom              0 start of period <NA>                    0          0
 2 Tom             10 play            <NA>                   10          0
 3 Tom             17 play            Tom                    17          1
 4 Tom             28 timeout         <NA>                    0          0
 5 Tom             47 play            <NA>                   19          0
 6 Tom             65 play            Tom                    37          1
 7 Brad            83 play            <NA>                    0          0
 8 Brad           100 play            <NA>                   17          0
 9 Brad           135 play            <NA>                   52          0
10 Brad           180 timeout         <NA>                    0          0
11 Brad           210 play            <NA>                   30          0
12 Brad           225 play            Brad                   45          1
13 Brad           237 play            <NA>                   57          1
14 Brad           253 play            <NA>                   73          1
15 Brad           276 play            Brad                   96          1
16 Brad           281 play            Brad                  101          2

CodePudding user response：

To be honest, I am not sure if I have understood the algorithm correctly.

Using dplyr and tidyr we first calculate the timesscored values:

library(tidyr)
library(dplyr)

df_score <- df %>% 
  group_by(grp = cumsum(timesincelastbreak == 0)) %>% # reset everything for timesincelastbreak == 0
  drop_na(scored) %>% 
  mutate(
    timesscored = ifelse(
      row_number() == 1, # first score is independent of timesincelastbreak
      1, 
      coalesce((p1 == scored), FALSE) * # check for p1 == scored
        (timesincelastbreak - lag(timesincelastbreak, default = 0) < 30) # check for < 30 seconds
      )
    ) %>% 
  ungroup() %>% 
  select(-grp)

This gives us

# A tibble: 5 × 6
  p1    elapsed_time event_type scored timesincelastbreak timesscored
  <chr>        <dbl> <chr>      <chr>               <dbl>       <dbl>
1 Tom             17 play       Tom                    17           1
2 Tom             65 play       Tom                    37           1
3 Brad           225 play       Brad                   45           1
4 Brad           276 play       Brad                   96           0
5 Brad           281 play       Brad                  101           1

Now we add those add back to the original data.frame and apply cumsum:

df %>% 
  left_join(df_score, 
            by = c("p1", "elapsed_time", "event_type", "scored", "timesincelastbreak")) %>%
  group_by(grp = cumsum(timesincelastbreak == 0)) %>% 
  mutate(timesscored = cumsum(coalesce(timesscored, 0))) %>% 
  ungroup() %>% 
  select(-grp)

This returns

# A tibble: 16 × 6
   p1    elapsed_time event_type      scored timesincelastbreak timesscored
   <chr>        <dbl> <chr>           <chr>               <dbl>       <dbl>
 1 Tom              0 start of period NA                      0           0
 2 Tom             10 play            NA                     10           0
 3 Tom             17 play            Tom                    17           1
 4 Tom             28 timeout         NA                      0           0
 5 Tom             47 play            NA                     19           0
 6 Tom             65 play            Tom                    37           1
 7 Brad            83 play            NA                      0           0
 8 Brad           100 play            NA                     17           0
 9 Brad           135 play            NA                     52           0
10 Brad           180 timeout         NA                      0           0
11 Brad           210 play            NA                     30           0
12 Brad           225 play            Brad                   45           1
13 Brad           237 play            NA                     57           1
14 Brad           253 play            NA                     73           1
15 Brad           276 play            Brad                   96           1
16 Brad           281 play            Brad                  101           2