Fast Granger Test/ Wald Test on large number of samples-CodePudding

I have two vectors x and y

x = c(1, -0.0023, -0.001, 0.0077, 0.0011, -0.0039, 3e-04, 0.0068, 
0.0134, 0.0143, 3e-04, 0.0026, 0.0033, 0.0056, -0.01, -0.0012, 
0.0052, 0.0061, -3e-04, -6e-04, 0.005, 0.006, 0.0249, 0.0066, 
0.0147, 0.0019, -0.0092, -0.001, 0.0027, -0.0196, 0.0211, 0.0016, 
-0.0112, -0.0065, 0.0035, 0.0157, -0.0085, 0.0062, 0, -0.0085, 
0.0163, 0.014, -0.0148, -1e-04, -0.0057, 0.0051, -0.0071, 0.0016, 
0.0066, 0.0033)
y = c(1, -0.0066, -0.015, 0.0017, -8e-04, -0.0177, -0.0026, 0.0068, 
0.0521, 0.0525, 0.0091, 0.0015, 0.0209, 0.0022, -0.0067, 0.0074, 
0.0132, 0.0094, 0.0029, 0.0178, 0.0154, 0.0076, 0.0551, 0.0194, 
0.0402, -0.0143, -0.0094, -0.0101, 0.0044, -0.0507, 0.0392, 0.0184, 
-0.0051, 0.0038, 0.0119, 0.0289, -0.0079, 0.0073, -0.0104, -0.0142, 
0.0366, 0.0301, -6e-04, 0.0075, -0.0029, 0.0035, -0.018, 0.0012, 
0.0088, -0.0117)

I am trying to run lmtest::waldtest on these vectors to find causality.

Method 1 works perfectly -

Method 1

  fit_1 = lm(x ~ y)
  lmtest::waldtest(fit_1, test = "F")

The intent is to make Method 1 run faster since I am running wald test on millions of samples. To improve performance over Method 1, I have written two other methods (2 and 3) and both of them are throwing errors as shown below.

Method 2 - Since my data is in data.table format, I will prefer to execute lm directly on data.table columns for improving performance.

  library(data.table)
  dt = data.table(x, y)
  fit_2 = lm(x ~ y, data = dt)
  lmtest::waldtest(fit_2, test = "F")

Error in as.data.frame.default(data, optional = TRUE) : 
  cannot coerce class ‘"function"’ to a data.frame

Method 3 - Since fastLM is much faster than lm function

  fit_3 = RcppArmadillo::fastLm(x ~ y)
  lmtest::waldtest(fit_3, test = "F")

Error in fastLm.formula(formula = x ~ 1) : 
  could not find function "fastLm.formula"

Q1. Can someone point out why Method 2 and Method 3 fail and are there ways to fix these?

Q2. And is there any other way to make Method 1 faster?

CodePudding user response：

Okay, so, let's start with Method 3. Very simply, we can see from the documentation for ?lmtest::wildest there is no method defined for a fastLm object and the default is failing. @Martin has a good explanation for how to use fast.lm()

For Method 2 however, you are actually running into namespace conflicts in the function calls. The issue is that dt is a reserved name in stats::dt.

> lmtest::waldtest(fit_2, test = "F")
#> Error in model.frame.default(formula = x ~ 1, data = dt, drop.unused.levels = TRUE) : 
  'data' must be a data.frame, environment, or list
> traceback()
#> 14: stop("'data' must be a data.frame, environment, or list")
#> 13: model.frame.default(formula = x ~ 1, data = dt, drop.unused.levels = TRUE)
#> 12: stats::model.frame(formula = x ~ 1, data = dt, drop.unused.levels = TRUE)
#> 11: eval(mf, parent.frame())
#> 10: eval(mf, parent.frame())
#> 9: lm(formula = x ~ 1, data = dt)
#> 8: eval(call, parent.frame())
#> 7: eval(call, parent.frame())
#> 6: update.default(fm, update)
#> 5: update0(fm, update)
#> 4: modelUpdate(objects[[i - 1]], objects[[i]])
#> 3: waldtest.default(object, . ~ 1, test = match.arg(test))
#> 2: waldtest.lm(fit_2, test = "F")
#> 1: lmtest::waldtest(fit_2, test = "F")

In update.default, 6th in the traceback(), the final line run is:

eval(call, parent.frame())

In this case, call == lm(formula = x ~ 1, data = dt). So, when this is run in the parent.frame, rather than accessing the data.table dt, it's pulling in the stats::dt function. There's a lot of documentation elsewhere from people smarter than myself on namespaces, parent frames, and environments, so won't go into detail here. However, your solution should just be to rename dt.

dt2 = data.table(x, y)
fit_2 = lm(x ~ y, data = dt2)
lmtest::waldtest(fit_2, test = "F")
#> Wald test
#> 
#> Model 1: x ~ y
#> Model 2: x ~ 1
#>   Res.Df Df      F    Pr(>F)    
#> 1     48                        
#> 2     49 -1 5253.1 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

CodePudding user response：

These procedures fail because waldtest() is looking for attributes/methods that don't exist in/for fitted model objects returned by fast.lm() and lm() with data.table input and I don't believe there is an easy fix.

However, if you want to test whether the model including y and a constant is "better" than the constant-only model you can use summary() (which has a method for objects of class fast.lm).

E.g. from the last line of summary(fit_3) you'll get

F-statistic:  5253 on 1 and 48 DF,  p-value: < 2.2e-16

which is equivalent to the result of waldtest():

lmtest::waldtest(fit_1, test = "F")

Wald test

Model 1: x ~ y
Model 2: x ~ 1
  Res.Df Df      F    Pr(>F)    
1     48                        
2     49 -1 5253.1 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1