Why are most of my residuals negative after logistic regression?-CodePudding

I have a dataset whereby the outcome variable is split into two groups (patients that had a side-effect or patients that did not). However, it's a bit uneven. (Those that did = 128 and those that did not = 1614). The results are as follows:

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3209  -0.4147  -0.3217  -0.2455   2.7529  

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -7.888112   0.859847  -9.174  < 2e-16 ***
age                    0.028529   0.009212   3.097  0.00196 ** 
bmi                    0.095759   0.015265   6.273 3.53e-10 ***
surgery_axilla_type11  0.923723   0.524588   1.761  0.07826 .  
surgery_axilla_type21  1.607389   0.600113   2.678  0.00740 ** 
surgery_axilla_type31  1.544822   0.573972   2.691  0.00711 ** 
cvd1                   0.624692   0.290005   2.154  0.03123 *  
rt_axilla1            -0.816374   0.353953  -2.306  0.02109 *

The outcome looks fine and the results seem to make sense, i.e. higher age, bmi, surgery to the armpits, cardiovascular disease increase side effects but when I inspect my residuals - most of them are negative. So I am assuming my model is under-predicting. I am not sure how to conduct any diagnostic plots because for linear regression the plot function in R will check if the residuals look good or if the QQ-plot looks good. I am not really sure how to move beyond checking the results regarding logistic regression. Here is my histogram for the residuals below:

CodePudding user response：

We don't have your data, so let's make a toy example that shows similar results.

Suppose there is a gene with two alleles called "A" and "B". Allele "A" carries a 20% chance of developing a certain disease, and allele "B" carries a 5% chance of developing the same disease. We take a population of 200 people, half of whom have allele "A" and half of whom have allele "B":

set.seed(1)

df <- data.frame(gene    = rep(c("A", "B"), each = 100),
                 disease = c(rbinom(100, 1, 0.2), rbinom(100, 1, 0.05)))

Next, we run a logistic regression to see if there is indeed a difference in the rates of disease between the two alleles:

model <- glm(disease ~ gene, data = df, family = binomial)

summary(model)
#>  
#>  Call:
#>  glm(formula = disease ~ gene, family = binomial, data = df)
#>  
#>  Deviance Residuals: 
#>      Min       1Q   Median       3Q      Max  
#>  -0.6105  -0.6105  -0.2857  -0.2857   2.5373  
#>  
#>  Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#>  (Intercept)  -1.5856     0.2662  -5.956 2.58e-09 ***
#>  geneB        -1.5924     0.5756  -2.767  0.00566 ** 
#>  ---
#>  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>  
#>  (Dispersion parameter for binomial family taken to be 1)
#>  
#>      Null deviance: 134.37  on 199  degrees of freedom
#>  Residual deviance: 124.77  on 198  degrees of freedom
#>  AIC: 128.77
#>  
#>  Number of Fisher Scoring iterations: 6

All good so far.

If we predict using the model, we get the expected probability of each person having the disease based on their genotype:

predictions <- predict(model, type = "response")

predictions
#>     1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18 
#>  0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 
#>    19   20   21   22   23   24   25   26   27   28   29   30   31   32   33   34   35   36 
#>  0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 
#>    37   38   39   40   41   42   43   44   45   46   47   48   49   50   51   52   53   54 
#>  0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 
#>    55   56   57   58   59   60   61   62   63   64   65   66   67   68   69   70   71   72 
#>  0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 
#>    73   74   75   76   77   78   79   80   81   82   83   84   85   86   87   88   89   90 
#>  0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 
#>    91   92   93   94   95   96   97   98   99  100  101  102  103  104  105  106  107  108 
#>  0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 
#>   109  110  111  112  113  114  115  116  117  118  119  120  121  122  123  124  125  126 
#>  0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 
#>   127  128  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144 
#>  0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 
#>   145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160  161  162 
#>  0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 
#>   163  164  165  166  167  168  169  170  171  172  173  174  175  176  177  178  179  180 
#>  0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 
#>   181  182  183  184  185  186  187  188  189  190  191  192  193  194  195  196  197  198 
#>  0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 
#>   199  200 
#>  0.04 0.04

Where we see that there is a predicted probability of 0.17 for those with allele "A" having the disease, and a predicted probability of 0.04 for those with allele "B" having the disease.

Note that because the disease has a prevalance of less than 50% in both groups, most predictions will be higher than their actual outcomes (since most outcomes are 0, and all predicted probabilities are greater than zero).

In logistic regression, we are trying to find the probabilities that minimize the squared deviance from each data point to its predicted value. The people who don't have the disease will therefore have negative residuals. So in this example, as in your own data, you will get a large number of small negative residuals, and a small number of large positive residuals (the residuals need to add to zero, after all).

hist(model$residuals)

Note that you would see the opposite pattern of a large number of small positive residuals and a small number of large negative residuals if most of the outcomes were 1 rather than zero.

In short, the distribution of residuals is expected given the distribution of your outcome data.