Home > database >  Cumulative sum of unique events for each year
Cumulative sum of unique events for each year

Time:10-14

I have a data frame like this:

id <- c("1", "1", "1", "1", "1", "1", "2", "2", "2", "3")
year <- c(1900,1900,1900,1901,1902,1902,1900,1901,1901,1900)
event <- c('A','B','C','A','D','E','A','B','C','A')
df <- data.frame(id, year, event)
df

id   year   event
1    1900   A
1    1900   B
1    1900   C
1    1901   A
1    1902   D
1    1902   E
2    1900   A
2    1901   B
2    1901   C
3    1900   A

Each event has a corresponding id and a year of occurrence. An event that occurred in the past can occur again in the same id. Each id experiences in a year a certain number of unique events. E.g., for id=1 in year=1900, this number is 3.

As output, I would like a (id,year)-combination, with the corresponding cumulative sum.

id   year  cum_sum
1    1900  3
1    1901  3
1    1902  5
2    1900  1
2    1901  3
3    1900  1

Thanks.

CodePudding user response:

Here is one option -

Save the event values in a list, using setdiff find out how many new event values are encountered in each year. Use cumsum to get cumulative sum for those id's.

library(dplyr)
library(purrr)

df %>%
  group_by(id, year) %>%
  summarise(list_event = list(event)) %>%
  mutate(cum_sum = map2_dbl(list_event, lag(list_event), ~length(setdiff(.x, .y))), 
         cum_sum = cumsum(cum_sum)) %>%
  ungroup %>%
  select(-list_event)

#  id     year cum_sum
#  <chr> <dbl>   <dbl>
#1 1      1900       3
#2 1      1901       3
#3 1      1902       5
#4 2      1900       1
#5 2      1901       3
#6 3      1900       1

CodePudding user response:

One dplyr option could be:

df %>%
    group_by(id) %>%
    mutate(cum_sum = cumsum(!duplicated(event))) %>%
    group_by(id, year) %>%
    summarise(cum_sum = max(cum_sum))

  id     year cum_sum
  <chr> <dbl>   <int>
1 1      1900       3
2 1      1901       3
3 1      1902       5
4 2      1900       1
5 2      1901       3
6 3      1900       1
  •  Tags:  
  • r
  • Related