How Do I Use bestglm For Subset Selection In Logistic Regression?-CodePudding

I have a dataframe as follows:

a <- rnorm(10,5,1)
b <- rnorm(10,5,1)
c <- rnorm(10,5,1)
d <- rnorm(10,5,1)
e <- rnorm(10,5,1)
f <- rnorm(10,5,1)
g <- rnorm(10,5,1)
h <- sample(c(0,1), replace=TRUE, size=10) # Random 1s and 0s

df = data.frame(a,b,c,d,e,f,g,h)

And I want to perform an exhaustive best subset selection on the variables a-g when run against h. I've tried the following:

library(bestglm)

bestglm(df,family=binomial,IC="AIC",method="exhaustive")

And the result is as follows:

AIC
BICq equivalent for q in (0.370228512040497, 0.759746926627138)
Best Model:
              Estimate Std. Error       z value  Pr(>|z|)
(Intercept) 1015.83048  1975540.4  0.0005142038 0.9995897
c            158.60524   243304.6  0.0006518794 0.9994799
d             77.54088   125534.0  0.0006176885 0.9995072
e           -279.63330   456645.4 -0.0006123642 0.9995114
f           -152.81443   246860.1 -0.0006190326 0.9995061

There were 34 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: algorithm did not converge
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
4: glm.fit: fitted probabilities numerically 0 or 1 occurred
5: glm.fit: algorithm did not converge
6: glm.fit: fitted probabilities numerically 0 or 1 occurred
7: glm.fit: algorithm did not converge
8: glm.fit: fitted probabilities numerically 0 or 1 occurred
9: glm.fit: algorithm did not converge
10: glm.fit: fitted probabilities numerically 0 or 1 occurred
11: glm.fit: fitted probabilities numerically 0 or 1 occurred
12: glm.fit: fitted probabilities numerically 0 or 1 occurred
13: glm.fit: algorithm did not converge
14: glm.fit: fitted probabilities numerically 0 or 1 occurred
15: glm.fit: algorithm did not converge
16: glm.fit: fitted probabilities numerically 0 or 1 occurred
17: glm.fit: algorithm did not converge
18: glm.fit: fitted probabilities numerically 0 or 1 occurred
19: glm.fit: algorithm did not converge
20: glm.fit: fitted probabilities numerically 0 or 1 occurred
21: glm.fit: fitted probabilities numerically 0 or 1 occurred
22: glm.fit: fitted probabilities numerically 0 or 1 occurred
23: glm.fit: algorithm did not converge
24: glm.fit: fitted probabilities numerically 0 or 1 occurred
25: glm.fit: algorithm did not converge
26: glm.fit: fitted probabilities numerically 0 or 1 occurred
27: glm.fit: algorithm did not converge
28: glm.fit: fitted probabilities numerically 0 or 1 occurred
29: glm.fit: algorithm did not converge
30: glm.fit: fitted probabilities numerically 0 or 1 occurred
31: glm.fit: algorithm did not converge
32: glm.fit: fitted probabilities numerically 0 or 1 occurred
33: glm.fit: fitted probabilities numerically 0 or 1 occurred
34: glm.fit: fitted probabilities numerically 0 or 1 occurred

I get all these warnings I'm not sure the meaning of. Furthermore, I'm not sure if the model automatically read the h variable as the target variable or not. I've checked the bestglm CRAN manual but I'm still not sure what the cause of the errors are or if the package automatically selects the target variable.

Could I get some clarification on this please?

Thank you

CodePudding user response：

There are a couple of questions here. The first is "does bestglm recognise h as the dependent variable?". The answer to this is "yes", but only because h is the last column in your data frame. You can see in the source code that the dependent variable is found using the line:

   y <- Xy[, ncol(Xy)]

Where Xy is the input data frame. It is checked for being a binary variable by the line:

LastColumnBinaryQ <- length(unique(y)) == 2

So, you can be satisfied that h is being used as your dependent variable here.

The second question is "why am I getting all these warnings?". The answer to that is simply that some of the models being tested cannot be fit to the data. You have a very small sample size, and therefore there will be some models where the fit is perfect, and some where no fit can be found because there is no association between the a-g columns and the h column. If you increase the sample size to an adequate level, you won't get any of these warnings. For example, if we make your data reproducible using set.seed, then a sample size of 30 is adequate to avoid any warnings:

library(bestglm)

set.seed(1)

a <- rnorm(30, 5, 1)
b <- rnorm(30, 5, 1)
c <- rnorm(30, 5, 1)
d <- rnorm(30, 5, 1)
e <- rnorm(30, 5, 1)
f <- rnorm(30, 5, 1)
g <- rnorm(30, 5, 1)
h <- sample(c(0, 1), replace = TRUE, size = 30) 

df = data.frame(a, b, c, d, e, f, g, h)

bestglm(df, family = binomial, IC = "AIC", method = "exhaustive")
#> Morgan-Tatar search since family is non-gaussian.
#> AIC
#> BICq equivalent for q in (0.64142110949793, 0.765911128653973)
#> Best Model:
#>               Estimate Std. Error   z value   Pr(>|z|)
#> (Intercept) -8.4029989  3.9727179 -2.115176 0.03441492
#> c            0.6599014  0.4731627  1.394661 0.16311820
#> d            1.0114990  0.5297931  1.909234 0.05623190

Using this method with a large number of dependent variables and only 10 sample points is always going to be problematic. The warnings don't mean you are getting the wrong result from your data, but they should give you some pause for thought about whether you are over-fitting your model to your data.

^{Created on 2022-11-18 with reprex v2.0.2}