Given the structure of my data below, I would like to calculate the forward five years average annual growth rate (where the first year is lead(markup)
of the column markup
for each individual (individual are identified in the column (gvkey)
), and add to the data frame that average as a column. However, some individuals have less than five years observations, and for all individuals', their last 4 years observations, have less than 5 years of observations ahead of them. For those cases, the average annual growth rate should adjust to the number of observation ahead of them (with a maximum of 5).
dput(example)
structure(list(gvkey = c(1001L, 1001L, 1001L, 1003L, 1003L, 1003L,
1003L, 1003L, 1003L, 1003L, 1004L, 1004L, 1004L, 1004L, 1004L,
1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L,
1004L, 1004L, 1004L, 1004L, 1004L, 1004L), fyear = c(1983L, 1984L,
1985L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 1980L,
1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L,
1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L,
1999L), markup = c(3.02456418383518, 2.91714600416106, 2.97620103473762,
0.628645648836935, 0.538264738598443, 0.74536402337831, 0.89905329776662,
0.571759161863088, 0.510497237569061, 0.621391904401246, 0.320146680750145,
0.277978758953348, 0.31442332968701, 0.319433516915814, 0.324865816687745,
0.335264348013352, 0.328048313395744, 0.326632245360565, 0.340874293859881,
0.320374201245953, 0.27456562124358, 0.276693369097675, 0.245072145096866,
0.241026046834387, 0.242841330851661, 0.249635000371186, 0.257903948772679,
0.262641379065405, 0.261534064206543, 0.22953354130982)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -30L), groups = structure(list(
gvkey = c(1001L, 1003L, 1004L), .rows = structure(list(1:3,
4:10, 11:30), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), .drop = TRUE))
This is as far as I got:
example %>%
filter(fyear %in% 1980:2019)%>%
group_by((gvkey))%>%
mutate(markupchange = ((((lead(markup)-markup)/markup) (lead(markup, K =2)-lead(markup)/lead(markup)) (lead(markup, K =3)-lead(markup, k =2)/lead(markup, K=2)) (lead(markup, K =4)-lead(markup, k =3)/lead(markup, K=3)) (lead(markup, K =5)-lead(markup, k =4)/lead(markup, K=4))/5)))
What I can't figure out how to indicate is to shorten the lenght of the average annual growth rate for thoses cases with less than 5 observations ahead.
As an ouput I would like to get back the same data frame with the extra column for the average annual growth rate of the markup
.
The value in row 1 of the added column should be -0,00628231878798876 and in the second row 0,020547945.
Many thanks for any tips.
CodePudding user response:
Since your system is complaining about the size of your data, I'm going to recommend a non-join method (see the answer history for the first cut on this).
- It is a bespoke function for each group, so it internally determines which data and the mean;
- If necessary (based on memory complaints), I will suggest a
data.table
variant. I'll also provide thedplyr
version that works with this data, but it may not work with your larger data, in which casedata.table
(with its in-place referential semantics) will be preferred.
First, a quick helper-function:
fun <- function(mkup, fyr, span=5) {
sapply(fyr, function(yr) {
val <- mkup[between(fyr, yr, yr span)]
mean(c(diff(val), NA) / val, na.rm = TRUE)
})
}
noting that between
can be from either dplyr
or data.table
, they are equivalent in this use.
### data.table
library(data.table)
EX <- as.data.table(example)
EX[, avg5 := fun(markup, fyear), by = gvkey]
### (same result as below)
### dplyr
example %>%
group_by(gvkey) %>%
mutate(avg5 = fun(markup, fyear)) %>%
ungroup()
# # A tibble: 30 × 4
# gvkey fyear markup avg5
# <int> <int> <dbl> <dbl>
# 1 1001 1983 3.02 -0.00764
# 2 1001 1984 2.92 0.0202
# 3 1001 1985 2.98 NaN
# 4 1003 1983 0.629 -0.00480
# 5 1003 1984 0.538 0.0674
# 6 1003 1985 0.745 -0.0119
# 7 1003 1986 0.899 -0.0847
# 8 1003 1987 0.572 0.0550
# 9 1003 1988 0.510 0.217
# 10 1003 1989 0.621 NaN
# # … with 20 more rows
# # ℹ Use `print(n = ...)` to see more rows