fitted values from speedglm() look very different from fitted values with glm()


The fitted values returned from speedglm() look really different from those returned from glm() and i don't know why. For example, if I run this:

glm <- glm(married ~ treat   age   educ   black   hisp   nodegr, data = lalonde, family = "binomial")
fitted_vals <- glm$fitted.values

I get broadly what i'd expect, which is a fitted value per observation between 0 and 1 (the two possible values of married). E.g.


── Data Summary ────────────────────────
Name                       fitted_vals
Number of rows             445        
Number of columns          1          
Column type frequency:                
  numeric                  1          
Group variables            None       

── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate  mean     sd     p0   p25   p50   p75  p100 hist 
1 data                  0             1 0.169 0.0913 0.0378 0.105 0.147 0.205 0.627 ▇▅▁▁▁

However, if I run the same model using speedglm() i get pretty different results:

speedglm <- speedglm(married ~ treat   age   educ   black   hisp   nodegr, data = lalonde, family = binomial(), fitted = TRUE)
fitted_vals <- speedglm$linear.predictors


── Data Summary ────────────────────────
Name                       fitted_vals
Number of rows             445        
Number of columns          1          
Column type frequency:                
  numeric                  1          
Group variables            None       

── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
1 data                  0             1 -1.71 0.606 -3.24 -2.14 -1.76 -1.35 0.521 ▂▇▇▂▁

Does anyone know what's going on here? linear.predictors seems to be the analogous value to glm's fitted.values according to the documentation. It shouldn't, as far as I understand, be possible to get fitted values outside of the range of the dependent variable, but clearly that's what's happening

CodePudding user response:

"Linear predictors" are not the same as "fitted values", unless a GLM is fitted with an identity link. In general the linear predictor is eta = b0 b1*x1 b2*x2 ..., while the fitted value is mu = linkinv(eta), where linkinv is the inverse link function (e.g. logistic or inverse-logit in this case).

In general it's always safer to use accessor methods: that way you don't have to worry about internal definitions

## fitted values (data scale)
all.equal(fitted(glm), fitted(speedglm))  ## TRUE
## predicted values (linear-predictor scale)
all.equal(predict(glm), predict(speedglm))  ## TRUE
## predict(., type = "response") == fitted(.)
all.equal(predict(glm, type = "response"), fitted(speedglm)) ## TRUE
