I have two vectors x
and y
x = c(1, -0.0023, -0.001, 0.0077, 0.0011, -0.0039, 3e-04, 0.0068,
0.0134, 0.0143, 3e-04, 0.0026, 0.0033, 0.0056, -0.01, -0.0012,
0.0052, 0.0061, -3e-04, -6e-04, 0.005, 0.006, 0.0249, 0.0066,
0.0147, 0.0019, -0.0092, -0.001, 0.0027, -0.0196, 0.0211, 0.0016,
-0.0112, -0.0065, 0.0035, 0.0157, -0.0085, 0.0062, 0, -0.0085,
0.0163, 0.014, -0.0148, -1e-04, -0.0057, 0.0051, -0.0071, 0.0016,
0.0066, 0.0033)
y = c(1, -0.0066, -0.015, 0.0017, -8e-04, -0.0177, -0.0026, 0.0068,
0.0521, 0.0525, 0.0091, 0.0015, 0.0209, 0.0022, -0.0067, 0.0074,
0.0132, 0.0094, 0.0029, 0.0178, 0.0154, 0.0076, 0.0551, 0.0194,
0.0402, -0.0143, -0.0094, -0.0101, 0.0044, -0.0507, 0.0392, 0.0184,
-0.0051, 0.0038, 0.0119, 0.0289, -0.0079, 0.0073, -0.0104, -0.0142,
0.0366, 0.0301, -6e-04, 0.0075, -0.0029, 0.0035, -0.018, 0.0012,
0.0088, -0.0117)
I am trying to run lmtest::waldtest
on these vectors to find causality.
Method 1 works perfectly -
Method 1
fit_1 = lm(x ~ y)
lmtest::waldtest(fit_1, test = "F")
The intent is to make Method 1
run faster since I am running wald test
on millions of samples. To improve performance over Method 1
, I have written two other methods (2 and 3) and both of them are throwing errors as shown below.
Method 2 - Since my data is in data.table format, I will prefer to execute lm
directly on data.table columns for improving performance.
library(data.table)
dt = data.table(x, y)
fit_2 = lm(x ~ y, data = dt)
lmtest::waldtest(fit_2, test = "F")
Error in as.data.frame.default(data, optional = TRUE) :
cannot coerce class ‘"function"’ to a data.frame
Method 3 - Since fastLM is much faster than lm
function
fit_3 = RcppArmadillo::fastLm(x ~ y)
lmtest::waldtest(fit_3, test = "F")
Error in fastLm.formula(formula = x ~ 1) :
could not find function "fastLm.formula"
Q1. Can someone point out why Method 2 and Method 3 fail and are there ways to fix these?
Q2. And is there any other way to make Method 1 faster?
CodePudding user response:
Okay, so, let's start with Method 3. Very simply, we can see from the documentation for ?lmtest::wildest
there is no method defined for a fastLm
object and the default is failing. @Martin has a good explanation for how to use fast.lm()
For Method 2 however, you are actually running into namespace conflicts in the function calls. The issue is that dt
is a reserved name in stats::dt
.
> lmtest::waldtest(fit_2, test = "F")
#> Error in model.frame.default(formula = x ~ 1, data = dt, drop.unused.levels = TRUE) :
'data' must be a data.frame, environment, or list
> traceback()
#> 14: stop("'data' must be a data.frame, environment, or list")
#> 13: model.frame.default(formula = x ~ 1, data = dt, drop.unused.levels = TRUE)
#> 12: stats::model.frame(formula = x ~ 1, data = dt, drop.unused.levels = TRUE)
#> 11: eval(mf, parent.frame())
#> 10: eval(mf, parent.frame())
#> 9: lm(formula = x ~ 1, data = dt)
#> 8: eval(call, parent.frame())
#> 7: eval(call, parent.frame())
#> 6: update.default(fm, update)
#> 5: update0(fm, update)
#> 4: modelUpdate(objects[[i - 1]], objects[[i]])
#> 3: waldtest.default(object, . ~ 1, test = match.arg(test))
#> 2: waldtest.lm(fit_2, test = "F")
#> 1: lmtest::waldtest(fit_2, test = "F")
In update.default
, 6th in the traceback()
, the final line run is:
eval(call, parent.frame())
In this case, call == lm(formula = x ~ 1, data = dt)
. So, when this is run in the parent.frame
, rather than accessing the data.table
dt
, it's pulling in the stats::dt
function. There's a lot of documentation elsewhere from people smarter than myself on namespaces, parent frames, and environments, so won't go into detail here. However, your solution should just be to rename dt
.
dt2 = data.table(x, y)
fit_2 = lm(x ~ y, data = dt2)
lmtest::waldtest(fit_2, test = "F")
#> Wald test
#>
#> Model 1: x ~ y
#> Model 2: x ~ 1
#> Res.Df Df F Pr(>F)
#> 1 48
#> 2 49 -1 5253.1 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
CodePudding user response:
These procedures fail because waldtest()
is looking for attributes/methods that don't exist in/for fitted model objects returned by fast.lm()
and lm()
with data.table input and I don't believe there is an easy fix.
However, if you want to test whether the model including y
and a constant is "better" than the constant-only model you can use summary()
(which has a method for objects of class fast.lm
).
E.g. from the last line of summary(fit_3)
you'll get
F-statistic: 5253 on 1 and 48 DF, p-value: < 2.2e-16
which is equivalent to the result of waldtest()
:
lmtest::waldtest(fit_1, test = "F")
Wald test
Model 1: x ~ y
Model 2: x ~ 1
Res.Df Df F Pr(>F)
1 48
2 49 -1 5253.1 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1