Home > Enterprise >  Conditional cumulative sum in R with dplyr. All dates before current date
Conditional cumulative sum in R with dplyr. All dates before current date

Time:11-07

I am looking for a way use a cumulative sum within R with the condition of not including the current date.

I have the following data frame (which is a subset and simplified version of the real data frame):

df <- structure(list(date_time = structure(c(1609513200, 1609513200, 1609513200,
  1609516800, 1609516800, 1609516800, 1609599600, 1609599600, 1609599600, 
  1609603200, 1609603200, 1609603200), tzone = "UTC", class = c("POSIXct", 
  "POSIXt")), event = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L), 
  person = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C"), 
  did_attend = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 1L), 
  events_attended = c(0, 0, 0, 1, 1, 1, 2, 2, 1, 2, 3, 2), 
  events_attended_desired = c(0L, 0L, 0L, 0L, 0L, 0L, 2L, 2L, 1L, 2L, 2L, 1L)), 
  class = c("grouped_df", "tbl_df", "tbl", "data.frame"), 
  row.names = c(NA, -12L), groups = structure(list(person = c("A", "B", "C"),
  .rows = structure(list(c(1L, 4L, 7L, 10L), c(2L, 5L, 8L, 11L), 
  c(3L, 6L, 9L, 12L)), ptype = integer(0), 
  class = c("vctrs_list_of", "vctrs_vctr", "list"))), 
  class = c("tbl_df", "tbl", "data.frame"), 
  row.names = c(NA, -3L), .drop = TRUE))
 
 df
 ## date_time           event person did_attend events_attended events_attended_desired
 ## 2021-01-01 15:00:00     1 A               1               0                       0
 ## 2021-01-01 15:00:00     1 B               1               0                       0
 ## 2021-01-01 15:00:00     1 C               1               0                       0
 ## 2021-01-01 16:00:00     2 A               1               1                       0
 ## 2021-01-01 16:00:00     2 B               1               1                       0
 ## 2021-01-01 16:00:00     2 C               0               1                       0
 ## 2021-01-02 15:00:00     1 A               0               2                       2
 ## 2021-01-02 15:00:00     1 B               1               2                       2
 ## 2021-01-02 15:00:00     1 C               1               1                       1
 ## 2021-01-02 16:00:00     2 A               1               2                       2
 ## 2021-01-02 16:00:00     2 B               0               3                       2
 ## 2021-01-02 16:00:00     2 C               1               2                       1

The column "did_attend" is a dummy variable which signifies if a person attended the event. The "events_attended" column has obviously been produced by

events <- events %>% 
  arrange(date_time) %>% 
  group_by(person) %>% 
  mutate(events_attended = lag(cumsum(did_attend), default = 0)) %>% 
  ungroup()

Now I am looking for a way to not include the events of the current date, so the cumulative sum should only sum over the dates prior to the current date (The desired output is in events_attended_desired column). There are several events each day and the number of events is different on each day. So a lag version does not work. I tried several ifelse() in the cumsum function but they didn't work either because I don't know how to compare the dates in an ifelse clause within cumsum()

CodePudding user response:

Here's an approach using dplyr and lubridate::floor_date.

First, I add a "date" column to the data frame, so that I can summarize and join based on the date.

Then I join this table to a summarized version of itself. count(date, wt = did_attend) is a shortcut for group_by(date) %>% summarize(n = sum(did_attend)), so if I then take the lag of that, we get the desired result.

df2 <- df %>%
  mutate(date = lubridate::floor_date(date_time, "day"))

df2 %>%
  left_join(
    df2 %>% 
      count(date, wt = did_attend) %>%
      mutate(prior_attended = cumsum(lag(n, default = 0))) %>%
      select(-n)
  )

CodePudding user response:

Multiply each number by 1 if it corresponds to a prior date and 0 otherwise.

 library(dplyr)
 df %>% 
   mutate(events_attended = sapply(as.Date(date_time), 
      function(x) sum((as.Date(date_time) < x) * did_attend))) %>%
   arrange(date_time) %>%
   ungroup
  • Related