I have a data frame like this:
id <- c("1", "1", "1", "1", "1", "1", "2", "2", "2", "3")
year <- c(1900,1900,1900,1901,1902,1902,1900,1901,1901,1900)
event <- c('A','B','C','A','D','E','A','B','C','A')
df <- data.frame(id, year, event)
df
id year event
1 1900 A
1 1900 B
1 1900 C
1 1901 A
1 1902 D
1 1902 E
2 1900 A
2 1901 B
2 1901 C
3 1900 A
Each event has a corresponding id and a year of occurrence. An event that occurred in the past can occur again in the same id. Each id experiences in a year a certain number of unique events. E.g., for id=1 in year=1900, this number is 3.
As output, I would like a (id,year)-combination, with the corresponding cumulative sum.
id year cum_sum
1 1900 3
1 1901 3
1 1902 5
2 1900 1
2 1901 3
3 1900 1
Thanks.
CodePudding user response:
Here is one option -
Save the event
values in a list
, using setdiff
find out how many new event
values are encountered in each year
. Use cumsum
to get cumulative sum for those id
's.
library(dplyr)
library(purrr)
df %>%
group_by(id, year) %>%
summarise(list_event = list(event)) %>%
mutate(cum_sum = map2_dbl(list_event, lag(list_event), ~length(setdiff(.x, .y))),
cum_sum = cumsum(cum_sum)) %>%
ungroup %>%
select(-list_event)
# id year cum_sum
# <chr> <dbl> <dbl>
#1 1 1900 3
#2 1 1901 3
#3 1 1902 5
#4 2 1900 1
#5 2 1901 3
#6 3 1900 1
CodePudding user response:
One dplyr
option could be:
df %>%
group_by(id) %>%
mutate(cum_sum = cumsum(!duplicated(event))) %>%
group_by(id, year) %>%
summarise(cum_sum = max(cum_sum))
id year cum_sum
<chr> <dbl> <int>
1 1 1900 3
2 1 1901 3
3 1 1902 5
4 2 1900 1
5 2 1901 3
6 3 1900 1