Home > database >  Sum previous 3 and 5 observations by group, ID and date in R
Sum previous 3 and 5 observations by group, ID and date in R

Time:11-15

I have a very large database that looks like this. For cntext, the data appartains to different companies with their related CEOs (ID) and the different years each CEO was in charge

ID <- c(1,1,1,1,1,1,3,3,3,5,5,4,4,4,4,4,4,4)
C <- c('a','a','a','a','a','a','b','b','b','b','b','c','c','c','c','c','c','c')
fyear <- c(2000, 2001, 2002,2003,2004,2005,2000, 2001,2002,2003,2004,2000, 2001, 2002,2003,2004,2005,2006)
data <- c(30,50,22,3,6,11,5,3,7,6,9,31,5,6,7,44,33,2)
df1 <- data.frame(ID,C,fyear, data)

ID  C   fyear  data
1   a   2000    30  
1   a   2001    50  
1   a   2002    22  
1   a   2003    3   
1   a   2004    6   
1   a   2005    11  
3   b   2000    5   
3   b   2001    3   
3   b   2002    7   
5   b   2003    6   
5   b   2004    9   
4   c   2000    31  
4   c   2001    5   
4   c   2002    6   
4   c   2003    7   
4   c   2004    44  
4   c   2005    33  
4   c   2006    2       

I need to build a code that allows me to sum up the previous 5 and 3 data related to each ID for every year. So t-3 and t-5 for each year. The result is something like this.

ID  C   fyear  data data3data5
1   a   2000    30  NA  NA
1   a   2001    50  NA  NA
1   a   2002    22  102 NA
1   a   2003    3   75  NA
1   a   2004    6   31  111
1   a   2005    11  20  86
3   b   2000    5   NA  NA
3   b   2001    3   NA  NA
3   b   2002    7   15  NA
5   b   2003    6   NA  NA
5   b   2004    9   NA  NA
4   c   2000    31  NA  NA
4   c   2001    5   NA  NA
4   c   2002    6   42  NA
4   c   2003    7   18  NA
4   c   2004    44  57  93
4   c   2005    33  84  95
4   c   2006    2   79  92

I have different columns of data for which I need to perform this operation, so if somebody also knows how I can do that and create a data3 and data5 column also for the other columns of data that I have that would be amazing. But even just being able to do the summation that I need is great! Thanks a lot. I hav looked around but don't seem to find any similar cses that satisfy my need

CodePudding user response:

To solve your specific question, this is a tidyverse solution:

df1 %>% 
  arrange(C, ID, fyear) %>% 
  group_by(C, ID) %>% 
  mutate(
    fyear3=rowSums(list(sapply(1:3, function(x) lag(data, x)))[[1]]),
    fyear5=rowSums(list(sapply(1:5, function(x) lag(data, x)))[[1]])
  ) %>%
  ungroup()
# A tibble: 18 × 6
      ID C     fyear  data fyear3 fyear5
   <dbl> <chr> <dbl> <dbl>  <dbl>  <dbl>
 1     1 a      2000    30     NA     NA
 2     1 a      2001    50     NA     NA
 3     1 a      2002    22     NA     NA
 4     1 a      2003     3    102     NA
 5     1 a      2004     6     75     NA
 6     1 a      2005    11     31    111
 7     3 b      2000     5     NA     NA
 8     3 b      2001     3     NA     NA
 9     3 b      2002     7     NA     NA
10     5 b      2003     6     NA     NA
11     5 b      2004     9     NA     NA
12     4 c      2000    31     NA     NA
13     4 c      2001     5     NA     NA
14     4 c      2002     6     NA     NA
15     4 c      2003     7     42     NA
16     4 c      2004    44     18     NA
17     4 c      2005    33     57     93
18     4 c      2006     2     84     95

The first mutate is a little hairy, so lets break one of the assignments down...

Find the nth lagged values of the data column, for n=1, 2 and 3.

sapply(1:3, function(x) lag(data, x))

Changes in CEO and Company are handled by the group_by() earlier in the pipe.

Create a list of these lagged values.

list(sapply(1:3, function(x) lag(data, x)))[[1]]

Row by row, calculate the sums of the lagged values

fyear3=rowSums(list(sapply(1:3, function(x) lag(data, x)))[[1]])

Now generalise the problem. Write a function takes as its inputs a dataset (so it works in a pipe), the new column, the column containing the values for which a lagged sum is required, and an integer defining the maximum lag.

lagSum <- function(data, newCol, valueCol, maxLag) {
  data %>% 
    mutate(
      {{newCol}} := rowSums(
                      list(
                        sapply(
                          1:maxLag, 
                          function(x) lag({{valueCol}}, x)
                        )
                      )[[1]]
                    )
    ) %>% 
    ungroup()
}

The embracing ({{ and }}) and use of := is required to handle tidyverse's non-standard evaluation (NSE).

Now use the function.

df1 %>% 
  arrange(C, ID, fyear) %>% 
  group_by(C, ID) %>% 
  lagSum(sumFYear3, data, 3) %>% 
  lagSum(sumFYear5, data, 5)
# A tibble: 18 × 6
      ID C     fyear  data sumFYear3 sumFYear5
   <dbl> <chr> <dbl> <dbl>     <dbl>     <dbl>
 1     1 a      2000    30        NA        NA
 2     1 a      2001    50        NA        NA
 3     1 a      2002    22        NA        NA
 4     1 a      2003     3       102        NA
 5     1 a      2004     6        75        NA
 6     1 a      2005    11        31       111
 7     3 b      2000     5        NA        92
 8     3 b      2001     3        NA        47
 9     3 b      2002     7        NA        28
10     5 b      2003     6        NA        32
11     5 b      2004     9        NA        32
12     4 c      2000    31        NA        30
13     4 c      2001     5        NA        56
14     4 c      2002     6        NA        58
15     4 c      2003     7        42        57
16     4 c      2004    44        18        58
17     4 c      2005    33        57        93
18     4 c      2006     2        84        95

EDIT

I misunderstood what you meant by "lag" and didn't read your description properly. My apologies.

I think your 86 in row 6 of your data5 column should be 92. if not, please explain why not.

Getting the answers you want should be a simple matter of adapting the function I wrote. For example:

lagSum <- function(data, newCol, valueCol, maxLag) {
  data %>% 
    mutate(
      {{newCol}} := {{valueCol}}   rowSums(
        list(
          sapply(
            1:maxLag, 
            function(x) lag({{valueCol}}, x)
          )
        )[[1]]
      )
    ) %>% 
    mutate() %>%
    ungroup() 
}

Gives

df1 %>% 
  arrange(C, ID, fyear) %>% 
  group_by(C, ID) %>% 
  lagSum(sumFYear3, value, 2)
# A tibble: 18 × 5
      ID C     fyear value sumFYear3
   <dbl> <chr> <dbl> <dbl>     <dbl>
 1     1 a      2000    30        NA
 2     1 a      2001    50        NA
 3     1 a      2002    22       102
 4     1 a      2003     3        75
 5     1 a      2004     6        31
 6     1 a      2005    11        20
 7     3 b      2000     5        NA
 8     3 b      2001     3        NA
 9     3 b      2002     7        15
10     5 b      2003     6        NA
11     5 b      2004     9        NA
12     4 c      2000    31        NA
13     4 c      2001     5        NA
14     4 c      2002     6        42
15     4 c      2003     7        18
16     4 c      2004    44        57
17     4 c      2005    33        84
18     4 c      2006     2        79

and

df1 %>% 
  arrange(C, ID, fyear) %>% 
  group_by(C, ID) %>% 
  lagSum(sumFYear5, value, 4)
# A tibble: 18 × 5
      ID C     fyear value sumFYear5
   <dbl> <chr> <dbl> <dbl>     <dbl>
 1     1 a      2000    30        NA
 2     1 a      2001    50        NA
 3     1 a      2002    22        NA
 4     1 a      2003     3        NA
 5     1 a      2004     6       111
 6     1 a      2005    11        92
 7     3 b      2000     5        NA
 8     3 b      2001     3        NA
 9     3 b      2002     7        NA
10     5 b      2003     6        NA
11     5 b      2004     9        NA
12     4 c      2000    31        NA
13     4 c      2001     5        NA
14     4 c      2002     6        NA
15     4 c      2003     7        NA
16     4 c      2004    44        93
17     4 c      2005    33        95
18     4 c      2006     2        92

as expected, but

df1 %>% 
  arrange(C, ID, fyear) %>% 
  group_by(C, ID) %>% 
  lagSum(sumFYear3, value, 2) %>% 
  lagSum(sumFYear5, value, 4)
# A tibble: 18 × 6
      ID C     fyear value sumFYear3 sumFYear5
   <dbl> <chr> <dbl> <dbl>     <dbl>     <dbl>
 1     1 a      2000    30        NA        NA
 2     1 a      2001    50        NA        NA
 3     1 a      2002    22       102        NA
 4     1 a      2003     3        75        NA
 5     1 a      2004     6        31       111
 6     1 a      2005    11        20        92
 7     3 b      2000     5        NA        47
 8     3 b      2001     3        NA        28
 9     3 b      2002     7        15        32
10     5 b      2003     6        NA        32
11     5 b      2004     9        NA        30
12     4 c      2000    31        NA        56
13     4 c      2001     5        NA        58
14     4 c      2002     6        42        57
15     4 c      2003     7        18        58
16     4 c      2004    44        57        93
17     4 c      2005    33        84        95
18     4 c      2006     2        79        92

Not as expected. At the moment, I cannot explain why. I managed to get the correct answers for both 3 and 5 year lags in the same pipe with:

df1 %>% 
  arrange(C, ID, fyear) %>% 
  group_by(C, ID) %>% 
  lagSum(sumFYear3, value, 2) %>% 
  left_join(
    df1 %>% 
      arrange(C, ID, fyear) %>% 
      group_by(C, ID) %>% 
      lagSum(sumFYear5, value, 4)
  )

But that shouldn't be necessary. I will think about this some more and may post a question of my own if I can't find an explanation.

Alternatively, this question gives a solution using the zoo package.

CodePudding user response:

We can use frollsum within data.table

library(data.table)

d <- 2:5

setDT(df1)[
  ,
  c(paste0("data", d)) := lapply(d, frollsum, x = data),
  .(ID, C)
]

which yields

> df1
    ID C fyear data data2 data3 data4 data5
 1:  1 a  2000   30    NA    NA    NA    NA
 2:  1 a  2001   50    80    NA    NA    NA
 3:  1 a  2002   22    72   102    NA    NA
 4:  1 a  2003    3    25    75   105    NA
 5:  1 a  2004    6     9    31    81   111
 6:  1 a  2005   11    17    20    42    92
 7:  3 b  2000    5    NA    NA    NA    NA
 8:  3 b  2001    3     8    NA    NA    NA
 9:  3 b  2002    7    10    15    NA    NA
10:  5 b  2003    6    NA    NA    NA    NA
11:  5 b  2004    9    15    NA    NA    NA
12:  4 c  2000   31    NA    NA    NA    NA
13:  4 c  2001    5    36    NA    NA    NA
14:  4 c  2002    6    11    42    NA    NA
15:  4 c  2003    7    13    18    49    NA
16:  4 c  2004   44    51    57    62    93
17:  4 c  2005   33    77    84    90    95
18:  4 c  2006    2    35    79    86    92
  • Related