I want to compute cumulative sum within 5 days for each group.
df <- data.frame(
date = ymd( c( "2022-01-02","2022-01-03","2022-01-05","2022-01-07","2022-01-11","2022-01-14","2022-01-17","2022-01-18","2022-01-24","2022-01-27","2022-01-01","2022-01-04","2022-01-04","2022-01-08","2022-01-12","2022-01-14","2022-01-19","2022-01-24","2022-01-25","2022-01-28")),
group = c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B"),
number = c(10,30,20,50,30,50,40,50,30,50,55,10,30,20,50,30,40,30,40,30))
A small sample of my data frame is below including what the cumulative sum column should return. Any help would be appreciated. Thanks.
date group number cumsum(s)
2022-01-02 A 10 10
2022-01-03 A 30 40
2022-01-05 A 20 60
2022-01-07 A 50 110
2022-01-11 A 30 80
2022-01-14 A 50 80
2022-01-17 A 40 90
2022-01-18 A 50 140
2022-01-24 A 30 30
2022-01-27 A 50 80
2022-01-01 B 55 55
2022-01-04 B 10 65
2022-01-04 B 30 95
2022-01-08 B 20 60
2022-01-12 B 50 70
2022-01-14 B 30 80
2022-01-19 B 40 70
2022-01-24 B 30 70
2022-01-25 B 40 70
2022-01-28 B 30 100
I tried to use map() and cumsum() but failed.
CodePudding user response:
This gives the expected output:
library(tidyverse)
df %>%
group_by(group) %>%
mutate(cumsum = sapply(date, \(x) sum(number[date >= (x - 5) & date <= x])))
#> # A tibble: 20 x 4
#> # Groups: group [2]
#> date group number cumsum
#> <date> <chr> <dbl> <dbl>
#> 1 2022-01-02 A 10 10
#> 2 2022-01-03 A 30 40
#> 3 2022-01-05 A 20 60
#> 4 2022-01-07 A 50 110
#> 5 2022-01-11 A 30 80
#> 6 2022-01-14 A 50 80
#> 7 2022-01-17 A 40 90
#> 8 2022-01-18 A 50 140
#> 9 2022-01-24 A 30 30
#> 10 2022-01-27 A 50 80
#> 11 2022-01-01 B 55 55
#> 12 2022-01-04 B 10 95
#> 13 2022-01-04 B 30 95
#> 14 2022-01-08 B 20 60
#> 15 2022-01-12 B 50 70
#> 16 2022-01-14 B 30 80
#> 17 2022-01-19 B 40 70
#> 18 2022-01-24 B 30 70
#> 19 2022-01-25 B 40 70
#> 20 2022-01-28 B 30 100
Created on 2022-10-30 with reprex v2.0.2
CodePudding user response:
You can join the data on itself, using a non-equi join in data.table
, and using .EACHI
to estimate the cumulative sum
library(data.table)
df = setDT(df)[, d:=date-5][]
cbind(
df[df,on=.(group,date<=date, date>=d), .(result = sum(number)), .EACHI][, .(group,date,result)],
df[, .(number)]
)
Output:
group date result number
1: A 2022-01-02 10 10
2: A 2022-01-03 40 30
3: A 2022-01-05 60 20
4: A 2022-01-07 110 50
5: A 2022-01-11 80 30
6: A 2022-01-14 80 50
7: A 2022-01-17 90 40
8: A 2022-01-18 140 50
9: A 2022-01-24 30 30
10: A 2022-01-27 80 50
11: B 2022-01-01 55 55
12: B 2022-01-04 95 10
13: B 2022-01-04 95 30
14: B 2022-01-08 60 20
15: B 2022-01-12 70 50
16: B 2022-01-14 80 30
17: B 2022-01-19 70 40
18: B 2022-01-24 70 30
19: B 2022-01-25 70 40
20: B 2022-01-28 100 30
CodePudding user response:
We could get the diff
erence of date
to create a logical vector, get the cumulative sum and use that in grouping and get the cumsum
of 'number'
library(dplyr)
df %>%
group_by(group) %>%
mutate(difgrp =cumsum(c(TRUE, diff(date) >=4))) %>%
group_by(diffgrp, .add = TRUE) %>%
mutate(s = cumsum(number)) %>%
ungroup %>%
select(-diffgrp)