Include the first lag of all variables in regression-CodePudding

I am looking for a way to include all variables of a data frame AND their within-group first lags in a regression model. My data has a shape similar to this:

df <- data.frame(group = c('A','A','A','A','B','B','B','B'),
                 y= 11:18,
                 x1= 21:28,
                 x2= 1:8)
df
    group  y      x1      x2
1     A     11     21      1
2     A     12     22      2
3     A     13     23      3
4     A     14     24      4
5     B     15     25      5
6     B     16     26      6
7     B     17     27      7
8     B     18     28      8

I want to include x1 and x2 as regressors as well as their respective first lags. It is important that the lags of x1 and x2 stay within their groups A and B (i.e. I am not interested in the difference between rows 5 and 4).

However my real dataset includes more than 150 variables so the approach to manually add the lagged values as a new variable to the data frame (e.g. df <- df %>% group_by(group) %>% mutate(value1_lag = lag(value1)) is not feasable.

Instead I am looking for something along the lines of

model <- dyn$lmf(y ~ . lag(.), data = df)

however it does not seem to work with the simple lag() operator.

Thanks in advance for your help!

CodePudding user response：

You could add the variables all at once using across:

df <- data.frame(group = c('A','A','A','A','B','B','B','B'),
                 y= 11:18,
                 x1= 21:28,
                 x2= 1:8)

library(dplyr)
library(tidyr)
df %>% 
  group_by(group) %>% 
  mutate(time = row_number(), 
         lag = across(-c(y, time), lag)) %>% 
  unnest(lag, names_sep = "_") %>%
  arrange(group, time) %>% 
  group_by(group) %>% 
  mutate(across(starts_with("lag_"), ~case_when(time == 1 ~ lead(.x), 
                                                TRUE ~ .x)))
#> # A tibble: 8 × 7
#> # Groups:   group [2]
#>   group     y    x1    x2  time lag_x1 lag_x2
#>   <chr> <int> <int> <int> <int>  <int>  <int>
#> 1 A        11    21     1     1     21      1
#> 2 A        12    22     2     2     21      1
#> 3 A        13    23     3     3     22      2
#> 4 A        14    24     4     4     23      3
#> 5 B        15    25     5     1     25      5
#> 6 B        16    26     6     2     25      5
#> 7 B        17    27     7     3     26      6
#> 8 B        18    28     8     4     27      7

^{Created on 2022-05-09 by the reprex package (v2.0.1)}

CodePudding user response：

You could grep the names of your regressors into vector xn, split the data by group, apply the lag function with a careful setNames, rbind.

xn <- grep('^X', names(dat), value=TRUE)

dat <- by(dat, dat$group, \(X) {
  cbind(X, setNames(as.data.frame(lapply(X[xn], \(x) c(NA, x[-length(x)]))), paste0(xn, '.lag1')))
}) |> do.call(what='rbind')

Then just reformulate while greping again with the collapse=d xn.

lm(reformulate(grep(paste(xn, collapse='|'), names(dat), value=TRUE), 'Y'), dat) |>
  summary() 
# Call:
# lm(formula = reformulate(grep(paste(xn, collapse = "|"), names(dat), 
#                                 value = TRUE), "Y"), data = dat)
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -1.85222 -0.31818  0.05702  0.33261  1.01141 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   0.6775     0.1768   3.832 0.000231 ***
# X1            1.5754     0.1688   9.335 5.22e-15 ***
# X2            2.5787     0.1750  14.737  < 2e-16 ***
# X1.lag1       0.7651     0.1652   4.630 1.18e-05 ***
# X2.lag1       1.9722     0.1777  11.098  < 2e-16 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.4892 on 93 degrees of freedom
# (2 observations deleted due to missingness)
# Multiple R-squared:  0.8234,  Adjusted R-squared:  0.8158 
# F-statistic: 108.4 on 4 and 93 DF,  p-value: < 2.2e-16

Data:

dat <- structure(list(X1 = c(0.915, 0.937, 0.286, 0.83, 0.642, 0.519, 
0.737, 0.135, 0.657, 0.705, 0.458, 0.719, 0.935, 0.255, 0.462, 
0.94, 0.978, 0.117, 0.475, 0.56, 0.904, 0.139, 0.989, 0.947, 
0.082, 0.514, 0.39, 0.906, 0.447, 0.836, 0.738, 0.811, 0.388, 
0.685, 0.004, 0.833, 0.007, 0.208, 0.907, 0.612, 0.38, 0.436, 
0.037, 0.974, 0.432, 0.958, 0.888, 0.64, 0.971, 0.619, 0.852, 
0.443, 0.158, 0.442, 0.968, 0.485, 0.252, 0.26, 0.542, 0.65, 
0.336, 0.061, 0.451, 0.839, 0.575, 0.353, 0.547, 0.893, 0.49, 
0.172, 0.543, 0.961, 0.314, 0.821, 0.307, 0.185, 0.048, 0.246, 
0.351, 0.159, 0.304, 0.018, 0.997, 0.804, 0.087, 0.87, 0.555, 
0.421, 0.068, 0.561, 0.071, 0.211, 0.55, 0.482, 0.159, 0.15, 
0.499, 0.941, 0.334, 0.188), X2 = c(0.333, 0.347, 0.398, 0.785, 
0.039, 0.749, 0.677, 0.171, 0.261, 0.514, 0.676, 0.983, 0.76, 
0.566, 0.85, 0.189, 0.271, 0.828, 0.693, 0.241, 0.043, 0.14, 
0.216, 0.479, 0.197, 0.719, 0.008, 0.375, 0.514, 0.002, 0.582, 
0.158, 0.359, 0.646, 0.776, 0.564, 0.234, 0.09, 0.086, 0.305, 
0.667, 0, 0.209, 0.933, 0.926, 0.734, 0.333, 0.515, 0.744, 0.619, 
0.27, 0.531, 0.021, 0.799, 0.11, 0.54, 0.571, 0.619, 0.715, 0.123, 
0.311, 0.946, 0.5, 0.135, 0.869, 0.205, 0.925, 0.887, 0.136, 
0.785, 0.453, 0.136, 0.885, 0.337, 0.319, 0.404, 0.479, 0.368, 
0.466, 0.05, 0.187, 0.983, 0.328, 0.171, 0.488, 0.019, 0.339, 
0.03, 0.867, 0.732, 0.315, 0.386, 0.332, 0.09, 0.757, 0.603, 
0.145, 0.033, 0.484, 0.445), Y = c(3.499, 5.284, 3.72, 5.143, 
3.938, 4.344, 5.567, 1.378, 2.717, 3.949, 4.769, 6.487, 7.533, 
4.3, 5.513, 4.778, 4.609, 4.701, 5.164, 3.182, 2.933, 2.001, 
2.57, 4.482, 2.953, 4.048, 2.989, 2.901, 3.544, 3.909, 4.188, 
4.041, 2.92, 3.816, 4.618, 4.804, 2.585, 1.627, 2.869, 3.913, 
3.746, 2.912, 1.832, 4.645, 6.05, 5.835, 4.471, 4.419, 6.171, 
6.155, 3.79, 3.859, 2.571, 2.721, 4.324, 3.452, 3.792, 4.636, 
4.482, 3.432, 3.037, 3.83, 4.799, 4.377, 4.026, 4.404, 4.303, 
6.612, 4.414, 4.023, 3.419, 4.627, 4.849, 4.603, 2.815, 3.36, 
3.341, 3.085, 3.44, 2.349, 2.222, 3.909, 4.596, 3.231, 3.66, 
3.347, 2.823, 2.017, 3.485, 4.754, 3.952, 2.139, 2.819, 3.519, 
3.637, 4.422, 4.153, 2.96, 2.407, 3.646), group = c("A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B")), row.names = c(NA, -100L), class = "data.frame")

CodePudding user response：

There are several problems in the question:

dyn is not the right package for longitudinal data.
The df shown in the question is not a good example because the columns are linearly dependent due to the use of linear sequences so try this example where we will use mpg in mtcars as the dependent variable, cyl as the grouping variable and hp and disp as the independent variables.

Note that there are multiple lag versions in different packages and also in base R so we explicitly refer to which one we want each time.

1) plm The plm package has facilities specifically for panel data. First add a time column and then use plm. The plm model= argument can provide for many other possibilities. plm assumes that the first column is the group and the second column is time.

library(plm)

mtcars2 <- transform(mtcars, time = ave(1:nrow(mtcars), cyl, FUN = seq_along))

indep <- names(mtcars2)[3:4]

L01 <- function(x) plm::lag(x, 0:1)
fo <- reformulate(sprintf("L01(%s)", indep), "mpg")
fo
## mpg ~ L01(disp)   L01(hp)

plm(fo, data = mtcars2, index = c("cyl", "time"), model = "pooling")

giving:

Model Formula: mpg ~ L01(disp)   L01(hp)

Coefficients:
(Intercept)  L01(disp)0  L01(disp)1    L01(hp)0    L01(hp)1 
 31.0212451  -0.0327843  -0.0019717  -0.0243698   0.0051201

2) lm Compute the lagged variables using dplyr, the formula using reformulate and then invoke lm.

library(dplyr)

indep <- names(mtcars2)[3:4]

mtcars2 <- mtcars %>%
  group_by(cyl = factor(cyl)) %>%
  mutate(across(all_of(indep), dplyr::lag, .names = "lag.{col}")) %>%
  ungroup

fo <- reformulate(c(indep, paste0("lag.", indep)), "mpg")
fo
## mpg ~ disp   hp   lag.disp   lag.hp

lm(fo, mtcars2, na.action = na.exclude)

giving the same coefficients:

Call:
lm(formula = fo, data = mtcars2, na.action = na.exclude)

Coefficients:
(Intercept)         disp           hp     lag.disp       lag.hp  
  31.021245    -0.032784    -0.024370    -0.001972     0.005120

Update

Reorganized answer.