I have a very large database that looks like this. For cntext, the data
appartains to different companies with their related CEOs (ID
) and the different years each CEO was in charge
ID <- c(1,1,1,1,1,1,3,3,3,5,5,4,4,4,4,4,4,4)
C <- c('a','a','a','a','a','a','b','b','b','b','b','c','c','c','c','c','c','c')
fyear <- c(2000, 2001, 2002,2003,2004,2005,2000, 2001,2002,2003,2004,2000, 2001, 2002,2003,2004,2005,2006)
data <- c(30,50,22,3,6,11,5,3,7,6,9,31,5,6,7,44,33,2)
df1 <- data.frame(ID,C,fyear, data)
ID C fyear data
1 a 2000 30
1 a 2001 50
1 a 2002 22
1 a 2003 3
1 a 2004 6
1 a 2005 11
3 b 2000 5
3 b 2001 3
3 b 2002 7
5 b 2003 6
5 b 2004 9
4 c 2000 31
4 c 2001 5
4 c 2002 6
4 c 2003 7
4 c 2004 44
4 c 2005 33
4 c 2006 2
I need to build a code that allows me to sum up the previous 5 and 3 data
related to each ID
for every year. So t-3 and t-5 for each year. The result is something like this.
ID C fyear data data3data5
1 a 2000 30 NA NA
1 a 2001 50 NA NA
1 a 2002 22 102 NA
1 a 2003 3 75 NA
1 a 2004 6 31 111
1 a 2005 11 20 86
3 b 2000 5 NA NA
3 b 2001 3 NA NA
3 b 2002 7 15 NA
5 b 2003 6 NA NA
5 b 2004 9 NA NA
4 c 2000 31 NA NA
4 c 2001 5 NA NA
4 c 2002 6 42 NA
4 c 2003 7 18 NA
4 c 2004 44 57 93
4 c 2005 33 84 95
4 c 2006 2 79 92
I have different columns of data for which I need to perform this operation, so if somebody also knows how I can do that and create a data3
and data5
column also for the other columns of data that I have that would be amazing. But even just being able to do the summation that I need is great! Thanks a lot.
I hav looked around but don't seem to find any similar cses that satisfy my need
CodePudding user response:
To solve your specific question, this is a tidyverse solution:
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
mutate(
fyear3=rowSums(list(sapply(1:3, function(x) lag(data, x)))[[1]]),
fyear5=rowSums(list(sapply(1:5, function(x) lag(data, x)))[[1]])
) %>%
ungroup()
# A tibble: 18 × 6
ID C fyear data fyear3 fyear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 NA NA
4 1 a 2003 3 102 NA
5 1 a 2004 6 75 NA
6 1 a 2005 11 31 111
7 3 b 2000 5 NA NA
8 3 b 2001 3 NA NA
9 3 b 2002 7 NA NA
10 5 b 2003 6 NA NA
11 5 b 2004 9 NA NA
12 4 c 2000 31 NA NA
13 4 c 2001 5 NA NA
14 4 c 2002 6 NA NA
15 4 c 2003 7 42 NA
16 4 c 2004 44 18 NA
17 4 c 2005 33 57 93
18 4 c 2006 2 84 95
The first mutate is a little hairy, so lets break one of the assignments down...
Find the nth lagged values of the data
column, for n=1, 2 and 3.
sapply(1:3, function(x) lag(data, x))
Changes in CEO and Company are handled by the group_by()
earlier in the pipe.
Create a list of these lagged values.
list(sapply(1:3, function(x) lag(data, x)))[[1]]
Row by row, calculate the sums of the lagged values
fyear3=rowSums(list(sapply(1:3, function(x) lag(data, x)))[[1]])
Now generalise the problem. Write a function takes as its inputs a dataset (so it works in a pipe), the new column, the column containing the values for which a lagged sum is required, and an integer defining the maximum lag.
lagSum <- function(data, newCol, valueCol, maxLag) {
data %>%
mutate(
{{newCol}} := rowSums(
list(
sapply(
1:maxLag,
function(x) lag({{valueCol}}, x)
)
)[[1]]
)
) %>%
ungroup()
}
The embracing ({{
and }}
) and use of :=
is required to handle tidyverse's non-standard evaluation (NSE).
Now use the function.
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 3) %>%
lagSum(sumFYear5, data, 5)
# A tibble: 18 × 6
ID C fyear data sumFYear3 sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 NA NA
4 1 a 2003 3 102 NA
5 1 a 2004 6 75 NA
6 1 a 2005 11 31 111
7 3 b 2000 5 NA 92
8 3 b 2001 3 NA 47
9 3 b 2002 7 NA 28
10 5 b 2003 6 NA 32
11 5 b 2004 9 NA 32
12 4 c 2000 31 NA 30
13 4 c 2001 5 NA 56
14 4 c 2002 6 NA 58
15 4 c 2003 7 42 57
16 4 c 2004 44 18 58
17 4 c 2005 33 57 93
18 4 c 2006 2 84 95
EDIT
I misunderstood what you meant by "lag" and didn't read your description properly. My apologies.
I think your 86
in row 6 of your data5
column should be 92
. if not, please explain why not.
Getting the answers you want should be a simple matter of adapting the function I wrote. For example:
lagSum <- function(data, newCol, valueCol, maxLag) {
data %>%
mutate(
{{newCol}} := {{valueCol}} rowSums(
list(
sapply(
1:maxLag,
function(x) lag({{valueCol}}, x)
)
)[[1]]
)
) %>%
mutate() %>%
ungroup()
}
Gives
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, value, 2)
# A tibble: 18 × 5
ID C fyear value sumFYear3
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA
2 1 a 2001 50 NA
3 1 a 2002 22 102
4 1 a 2003 3 75
5 1 a 2004 6 31
6 1 a 2005 11 20
7 3 b 2000 5 NA
8 3 b 2001 3 NA
9 3 b 2002 7 15
10 5 b 2003 6 NA
11 5 b 2004 9 NA
12 4 c 2000 31 NA
13 4 c 2001 5 NA
14 4 c 2002 6 42
15 4 c 2003 7 18
16 4 c 2004 44 57
17 4 c 2005 33 84
18 4 c 2006 2 79
and
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear5, value, 4)
# A tibble: 18 × 5
ID C fyear value sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA
2 1 a 2001 50 NA
3 1 a 2002 22 NA
4 1 a 2003 3 NA
5 1 a 2004 6 111
6 1 a 2005 11 92
7 3 b 2000 5 NA
8 3 b 2001 3 NA
9 3 b 2002 7 NA
10 5 b 2003 6 NA
11 5 b 2004 9 NA
12 4 c 2000 31 NA
13 4 c 2001 5 NA
14 4 c 2002 6 NA
15 4 c 2003 7 NA
16 4 c 2004 44 93
17 4 c 2005 33 95
18 4 c 2006 2 92
as expected, but
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, value, 2) %>%
lagSum(sumFYear5, value, 4)
# A tibble: 18 × 6
ID C fyear value sumFYear3 sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 102 NA
4 1 a 2003 3 75 NA
5 1 a 2004 6 31 111
6 1 a 2005 11 20 92
7 3 b 2000 5 NA 47
8 3 b 2001 3 NA 28
9 3 b 2002 7 15 32
10 5 b 2003 6 NA 32
11 5 b 2004 9 NA 30
12 4 c 2000 31 NA 56
13 4 c 2001 5 NA 58
14 4 c 2002 6 42 57
15 4 c 2003 7 18 58
16 4 c 2004 44 57 93
17 4 c 2005 33 84 95
18 4 c 2006 2 79 92
Not as expected. At the moment, I cannot explain why. I managed to get the correct answers for both 3 and 5 year lags in the same pipe with:
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, value, 2) %>%
left_join(
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear5, value, 4)
)
But that shouldn't be necessary. I will think about this some more and may post a question of my own if I can't find an explanation.
Alternatively, this question gives a solution using the zoo
package.
CodePudding user response:
We can use frollsum
within data.table
library(data.table)
d <- 2:5
setDT(df1)[
,
c(paste0("data", d)) := lapply(d, frollsum, x = data),
.(ID, C)
]
which yields
> df1
ID C fyear data data2 data3 data4 data5
1: 1 a 2000 30 NA NA NA NA
2: 1 a 2001 50 80 NA NA NA
3: 1 a 2002 22 72 102 NA NA
4: 1 a 2003 3 25 75 105 NA
5: 1 a 2004 6 9 31 81 111
6: 1 a 2005 11 17 20 42 92
7: 3 b 2000 5 NA NA NA NA
8: 3 b 2001 3 8 NA NA NA
9: 3 b 2002 7 10 15 NA NA
10: 5 b 2003 6 NA NA NA NA
11: 5 b 2004 9 15 NA NA NA
12: 4 c 2000 31 NA NA NA NA
13: 4 c 2001 5 36 NA NA NA
14: 4 c 2002 6 11 42 NA NA
15: 4 c 2003 7 13 18 49 NA
16: 4 c 2004 44 51 57 62 93
17: 4 c 2005 33 77 84 90 95
18: 4 c 2006 2 35 79 86 92