Home > database >  Conditional cumulative sum associated with the date in R
Conditional cumulative sum associated with the date in R

Time:10-31

I want to compute cumulative sum within 5 days for each group.

df <- data.frame(
  date = ymd( c( "2022-01-02","2022-01-03","2022-01-05","2022-01-07","2022-01-11","2022-01-14","2022-01-17","2022-01-18","2022-01-24","2022-01-27","2022-01-01","2022-01-04","2022-01-04","2022-01-08","2022-01-12","2022-01-14","2022-01-19","2022-01-24","2022-01-25","2022-01-28")),
  group = c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B"),
  number = c(10,30,20,50,30,50,40,50,30,50,55,10,30,20,50,30,40,30,40,30))

A small sample of my data frame is below including what the cumulative sum column should return. Any help would be appreciated. Thanks.

date       group number cumsum(s)
2022-01-02 A     10     10
2022-01-03 A     30     40
2022-01-05 A     20     60
2022-01-07 A     50     110
2022-01-11 A     30     80
2022-01-14 A     50     80
2022-01-17 A     40     90
2022-01-18 A     50     140
2022-01-24 A     30     30
2022-01-27 A     50     80
2022-01-01 B     55     55
2022-01-04 B     10     65
2022-01-04 B     30     95
2022-01-08 B     20     60
2022-01-12 B     50     70
2022-01-14 B     30     80
2022-01-19 B     40     70
2022-01-24 B     30     70
2022-01-25 B     40     70
2022-01-28 B     30     100

I tried to use map() and cumsum() but failed.

CodePudding user response:

This gives the expected output:

library(tidyverse)

df %>%
  group_by(group) %>%
  mutate(cumsum = sapply(date, \(x) sum(number[date >= (x - 5) & date <= x])))
#> # A tibble: 20 x 4
#> # Groups:   group [2]
#>    date       group number cumsum
#>    <date>     <chr>  <dbl>  <dbl>
#>  1 2022-01-02 A         10     10
#>  2 2022-01-03 A         30     40
#>  3 2022-01-05 A         20     60
#>  4 2022-01-07 A         50    110
#>  5 2022-01-11 A         30     80
#>  6 2022-01-14 A         50     80
#>  7 2022-01-17 A         40     90
#>  8 2022-01-18 A         50    140
#>  9 2022-01-24 A         30     30
#> 10 2022-01-27 A         50     80
#> 11 2022-01-01 B         55     55
#> 12 2022-01-04 B         10     95
#> 13 2022-01-04 B         30     95
#> 14 2022-01-08 B         20     60
#> 15 2022-01-12 B         50     70
#> 16 2022-01-14 B         30     80
#> 17 2022-01-19 B         40     70
#> 18 2022-01-24 B         30     70
#> 19 2022-01-25 B         40     70
#> 20 2022-01-28 B         30    100

Created on 2022-10-30 with reprex v2.0.2

CodePudding user response:

You can join the data on itself, using a non-equi join in data.table, and using .EACHI to estimate the cumulative sum

library(data.table)

df = setDT(df)[, d:=date-5][]

cbind(
  df[df,on=.(group,date<=date, date>=d), .(result = sum(number)), .EACHI][, .(group,date,result)],
  df[, .(number)]
)

Output:

    group       date result number
 1:     A 2022-01-02     10     10
 2:     A 2022-01-03     40     30
 3:     A 2022-01-05     60     20
 4:     A 2022-01-07    110     50
 5:     A 2022-01-11     80     30
 6:     A 2022-01-14     80     50
 7:     A 2022-01-17     90     40
 8:     A 2022-01-18    140     50
 9:     A 2022-01-24     30     30
10:     A 2022-01-27     80     50
11:     B 2022-01-01     55     55
12:     B 2022-01-04     95     10
13:     B 2022-01-04     95     30
14:     B 2022-01-08     60     20
15:     B 2022-01-12     70     50
16:     B 2022-01-14     80     30
17:     B 2022-01-19     70     40
18:     B 2022-01-24     70     30
19:     B 2022-01-25     70     40
20:     B 2022-01-28    100     30

CodePudding user response:

We could get the difference of date to create a logical vector, get the cumulative sum and use that in grouping and get the cumsum of 'number'

library(dplyr)
df %>%
   group_by(group) %>%
   mutate(difgrp =cumsum(c(TRUE, diff(date) >=4))) %>% 
   group_by(diffgrp, .add = TRUE) %>% 
   mutate(s = cumsum(number)) %>% 
   ungroup %>% 
   select(-diffgrp)
  • Related