Average Annual Growth Rate with moving length R-CodePudding

Given the structure of my data below, I would like to calculate the forward five years average annual growth rate (where the first year is lead(markup) of the column markup for each individual (individual are identified in the column (gvkey)), and add to the data frame that average as a column. However, some individuals have less than five years observations, and for all individuals', their last 4 years observations, have less than 5 years of observations ahead of them. For those cases, the average annual growth rate should adjust to the number of observation ahead of them (with a maximum of 5).

dput(example)
structure(list(gvkey = c(1001L, 1001L, 1001L, 1003L, 1003L, 1003L, 
1003L, 1003L, 1003L, 1003L, 1004L, 1004L, 1004L, 1004L, 1004L, 
1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 
1004L, 1004L, 1004L, 1004L, 1004L, 1004L), fyear = c(1983L, 1984L, 
1985L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 1980L, 
1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 
1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 
1999L), markup = c(3.02456418383518, 2.91714600416106, 2.97620103473762, 
0.628645648836935, 0.538264738598443, 0.74536402337831, 0.89905329776662, 
0.571759161863088, 0.510497237569061, 0.621391904401246, 0.320146680750145, 
0.277978758953348, 0.31442332968701, 0.319433516915814, 0.324865816687745, 
0.335264348013352, 0.328048313395744, 0.326632245360565, 0.340874293859881, 
0.320374201245953, 0.27456562124358, 0.276693369097675, 0.245072145096866, 
0.241026046834387, 0.242841330851661, 0.249635000371186, 0.257903948772679, 
0.262641379065405, 0.261534064206543, 0.22953354130982)), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -30L), groups = structure(list(
    gvkey = c(1001L, 1003L, 1004L), .rows = structure(list(1:3, 
        4:10, 11:30), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), .drop = TRUE))

This is as far as I got:

example %>%
  filter(fyear %in% 1980:2019)%>%
  group_by((gvkey))%>%
  mutate(markupchange = ((((lead(markup)-markup)/markup) (lead(markup, K =2)-lead(markup)/lead(markup)) (lead(markup, K =3)-lead(markup, k =2)/lead(markup, K=2)) (lead(markup, K =4)-lead(markup, k =3)/lead(markup, K=3)) (lead(markup, K =5)-lead(markup, k =4)/lead(markup, K=4))/5)))

What I can't figure out how to indicate is to shorten the lenght of the average annual growth rate for thoses cases with less than 5 observations ahead.

As an ouput I would like to get back the same data frame with the extra column for the average annual growth rate of the markup. The value in row 1 of the added column should be -0,00628231878798876 and in the second row 0,020547945. Many thanks for any tips.

CodePudding user response：

Since your system is complaining about the size of your data, I'm going to recommend a non-join method (see the answer history for the first cut on this).

It is a bespoke function for each group, so it internally determines which data and the mean;
If necessary (based on memory complaints), I will suggest a data.table variant. I'll also provide the dplyr version that works with this data, but it may not work with your larger data, in which case data.table (with its in-place referential semantics) will be preferred.

First, a quick helper-function:

fun <- function(mkup, fyr, span=5) {
  sapply(fyr, function(yr) {
    val <- mkup[between(fyr, yr, yr span)]
    mean(c(diff(val), NA) / val, na.rm = TRUE)
  })
}

noting that between can be from either dplyr or data.table, they are equivalent in this use.

### data.table
library(data.table)
EX <- as.data.table(example)
EX[, avg5 := fun(markup, fyear), by = gvkey]
### (same result as below)

### dplyr
example %>%
  group_by(gvkey) %>%
  mutate(avg5 = fun(markup, fyear)) %>%
  ungroup()
# # A tibble: 30 × 4
#    gvkey fyear markup      avg5
#    <int> <int>  <dbl>     <dbl>
#  1  1001  1983  3.02   -0.00764
#  2  1001  1984  2.92    0.0202 
#  3  1001  1985  2.98  NaN      
#  4  1003  1983  0.629  -0.00480
#  5  1003  1984  0.538   0.0674 
#  6  1003  1985  0.745  -0.0119 
#  7  1003  1986  0.899  -0.0847 
#  8  1003  1987  0.572   0.0550 
#  9  1003  1988  0.510   0.217  
# 10  1003  1989  0.621 NaN      
# # … with 20 more rows
# # ℹ Use `print(n = ...)` to see more rows