I am looking for a way to include all variables of a data frame AND their within-group first lags in a regression model. My data has a shape similar to this:
df <- data.frame(group = c('A','A','A','A','B','B','B','B'),
y= 11:18,
x1= 21:28,
x2= 1:8)
df
group y x1 x2
1 A 11 21 1
2 A 12 22 2
3 A 13 23 3
4 A 14 24 4
5 B 15 25 5
6 B 16 26 6
7 B 17 27 7
8 B 18 28 8
I want to include x1 and x2 as regressors as well as their respective first lags. It is important that the lags of x1 and x2 stay within their groups A and B (i.e. I am not interested in the difference between rows 5 and 4).
However my real dataset includes more than 150 variables so the approach to manually add the lagged values as a new variable to the data frame (e.g. df <- df %>% group_by(group) %>% mutate(value1_lag = lag(value1))
is not feasable.
Instead I am looking for something along the lines of
model <- dyn$lmf(y ~ . lag(.), data = df)
however it does not seem to work with the simple lag() operator.
Thanks in advance for your help!
CodePudding user response:
You could add the variables all at once using across
:
df <- data.frame(group = c('A','A','A','A','B','B','B','B'),
y= 11:18,
x1= 21:28,
x2= 1:8)
library(dplyr)
library(tidyr)
df %>%
group_by(group) %>%
mutate(time = row_number(),
lag = across(-c(y, time), lag)) %>%
unnest(lag, names_sep = "_") %>%
arrange(group, time) %>%
group_by(group) %>%
mutate(across(starts_with("lag_"), ~case_when(time == 1 ~ lead(.x),
TRUE ~ .x)))
#> # A tibble: 8 × 7
#> # Groups: group [2]
#> group y x1 x2 time lag_x1 lag_x2
#> <chr> <int> <int> <int> <int> <int> <int>
#> 1 A 11 21 1 1 21 1
#> 2 A 12 22 2 2 21 1
#> 3 A 13 23 3 3 22 2
#> 4 A 14 24 4 4 23 3
#> 5 B 15 25 5 1 25 5
#> 6 B 16 26 6 2 25 5
#> 7 B 17 27 7 3 26 6
#> 8 B 18 28 8 4 27 7
Created on 2022-05-09 by the reprex package (v2.0.1)
CodePudding user response:
You could grep
the names of your regressors into vector xn
, split the data by
group, apply the lag function with a careful setNames
, rbind
.
xn <- grep('^X', names(dat), value=TRUE)
dat <- by(dat, dat$group, \(X) {
cbind(X, setNames(as.data.frame(lapply(X[xn], \(x) c(NA, x[-length(x)]))), paste0(xn, '.lag1')))
}) |> do.call(what='rbind')
Then just reformulate
while grep
ing again with the collapse=
d xn
.
lm(reformulate(grep(paste(xn, collapse='|'), names(dat), value=TRUE), 'Y'), dat) |>
summary()
# Call:
# lm(formula = reformulate(grep(paste(xn, collapse = "|"), names(dat),
# value = TRUE), "Y"), data = dat)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.85222 -0.31818 0.05702 0.33261 1.01141
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.6775 0.1768 3.832 0.000231 ***
# X1 1.5754 0.1688 9.335 5.22e-15 ***
# X2 2.5787 0.1750 14.737 < 2e-16 ***
# X1.lag1 0.7651 0.1652 4.630 1.18e-05 ***
# X2.lag1 1.9722 0.1777 11.098 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4892 on 93 degrees of freedom
# (2 observations deleted due to missingness)
# Multiple R-squared: 0.8234, Adjusted R-squared: 0.8158
# F-statistic: 108.4 on 4 and 93 DF, p-value: < 2.2e-16
Data:
dat <- structure(list(X1 = c(0.915, 0.937, 0.286, 0.83, 0.642, 0.519,
0.737, 0.135, 0.657, 0.705, 0.458, 0.719, 0.935, 0.255, 0.462,
0.94, 0.978, 0.117, 0.475, 0.56, 0.904, 0.139, 0.989, 0.947,
0.082, 0.514, 0.39, 0.906, 0.447, 0.836, 0.738, 0.811, 0.388,
0.685, 0.004, 0.833, 0.007, 0.208, 0.907, 0.612, 0.38, 0.436,
0.037, 0.974, 0.432, 0.958, 0.888, 0.64, 0.971, 0.619, 0.852,
0.443, 0.158, 0.442, 0.968, 0.485, 0.252, 0.26, 0.542, 0.65,
0.336, 0.061, 0.451, 0.839, 0.575, 0.353, 0.547, 0.893, 0.49,
0.172, 0.543, 0.961, 0.314, 0.821, 0.307, 0.185, 0.048, 0.246,
0.351, 0.159, 0.304, 0.018, 0.997, 0.804, 0.087, 0.87, 0.555,
0.421, 0.068, 0.561, 0.071, 0.211, 0.55, 0.482, 0.159, 0.15,
0.499, 0.941, 0.334, 0.188), X2 = c(0.333, 0.347, 0.398, 0.785,
0.039, 0.749, 0.677, 0.171, 0.261, 0.514, 0.676, 0.983, 0.76,
0.566, 0.85, 0.189, 0.271, 0.828, 0.693, 0.241, 0.043, 0.14,
0.216, 0.479, 0.197, 0.719, 0.008, 0.375, 0.514, 0.002, 0.582,
0.158, 0.359, 0.646, 0.776, 0.564, 0.234, 0.09, 0.086, 0.305,
0.667, 0, 0.209, 0.933, 0.926, 0.734, 0.333, 0.515, 0.744, 0.619,
0.27, 0.531, 0.021, 0.799, 0.11, 0.54, 0.571, 0.619, 0.715, 0.123,
0.311, 0.946, 0.5, 0.135, 0.869, 0.205, 0.925, 0.887, 0.136,
0.785, 0.453, 0.136, 0.885, 0.337, 0.319, 0.404, 0.479, 0.368,
0.466, 0.05, 0.187, 0.983, 0.328, 0.171, 0.488, 0.019, 0.339,
0.03, 0.867, 0.732, 0.315, 0.386, 0.332, 0.09, 0.757, 0.603,
0.145, 0.033, 0.484, 0.445), Y = c(3.499, 5.284, 3.72, 5.143,
3.938, 4.344, 5.567, 1.378, 2.717, 3.949, 4.769, 6.487, 7.533,
4.3, 5.513, 4.778, 4.609, 4.701, 5.164, 3.182, 2.933, 2.001,
2.57, 4.482, 2.953, 4.048, 2.989, 2.901, 3.544, 3.909, 4.188,
4.041, 2.92, 3.816, 4.618, 4.804, 2.585, 1.627, 2.869, 3.913,
3.746, 2.912, 1.832, 4.645, 6.05, 5.835, 4.471, 4.419, 6.171,
6.155, 3.79, 3.859, 2.571, 2.721, 4.324, 3.452, 3.792, 4.636,
4.482, 3.432, 3.037, 3.83, 4.799, 4.377, 4.026, 4.404, 4.303,
6.612, 4.414, 4.023, 3.419, 4.627, 4.849, 4.603, 2.815, 3.36,
3.341, 3.085, 3.44, 2.349, 2.222, 3.909, 4.596, 3.231, 3.66,
3.347, 2.823, 2.017, 3.485, 4.754, 3.952, 2.139, 2.819, 3.519,
3.637, 4.422, 4.153, 2.96, 2.407, 3.646), group = c("A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B", "B", "B", "B")), row.names = c(NA, -100L), class = "data.frame")
CodePudding user response:
There are several problems in the question:
- dyn is not the right package for longitudinal data.
- The df shown in the question is not a good example because the columns are linearly dependent due to the use of linear sequences so try this example where we will use mpg in mtcars as the dependent variable, cyl as the grouping variable and hp and disp as the independent variables.
Note that there are multiple lag versions in different packages and also in base R so we explicitly refer to which one we want each time.
1) plm The plm package has facilities specifically for panel data. First add a time column and then use plm. The plm model= argument can provide for many other possibilities. plm assumes that the first column is the group and the second column is time.
library(plm)
mtcars2 <- transform(mtcars, time = ave(1:nrow(mtcars), cyl, FUN = seq_along))
indep <- names(mtcars2)[3:4]
L01 <- function(x) plm::lag(x, 0:1)
fo <- reformulate(sprintf("L01(%s)", indep), "mpg")
fo
## mpg ~ L01(disp) L01(hp)
plm(fo, data = mtcars2, index = c("cyl", "time"), model = "pooling")
giving:
Model Formula: mpg ~ L01(disp) L01(hp)
Coefficients:
(Intercept) L01(disp)0 L01(disp)1 L01(hp)0 L01(hp)1
31.0212451 -0.0327843 -0.0019717 -0.0243698 0.0051201
2) lm Compute the lagged variables using dplyr, the formula using reformulate and then invoke lm.
library(dplyr)
indep <- names(mtcars2)[3:4]
mtcars2 <- mtcars %>%
group_by(cyl = factor(cyl)) %>%
mutate(across(all_of(indep), dplyr::lag, .names = "lag.{col}")) %>%
ungroup
fo <- reformulate(c(indep, paste0("lag.", indep)), "mpg")
fo
## mpg ~ disp hp lag.disp lag.hp
lm(fo, mtcars2, na.action = na.exclude)
giving the same coefficients:
Call:
lm(formula = fo, data = mtcars2, na.action = na.exclude)
Coefficients:
(Intercept) disp hp lag.disp lag.hp
31.021245 -0.032784 -0.024370 -0.001972 0.005120
Update
Reorganized answer.