Why is the quantile function not working for this dplyr function?-CodePudding

I'm working through Faraway's 2016 book Extending the Linear Model with R and have encountered an issue with the code that I don't know how to fix. Here is the relevant syntax leading up to the error:

#### Load Data & Libraries ####
library(faraway)
library(tidyverse)
data(wcgs)

#### Add Variables ####
wcgs$y <- ifelse(wcgs$chd == "no",0,1) # create binary response from chd 
wcgs$bmi <- with(wcgs,
                 703*wcgs$weight/(wcgs$height^2)) # create BMI variable

#### Create GLM Model ####
lmod <- glm(chd ~ height   cigs,
            family = binomial, 
            wcgs)

#### Mutate Data ####
wcgs <- mutate(wcgs,
               residuals=residuals(lmod), 
               linpred=predict(lmod)) # create residuals/pred values

And this is the part where the error arises (the third line which includes a mutate function:

#### Error Code (Last Line) ####
wcgsm <- na.omit(wcgs) # omit NA values
wcgsm <- mutate(wcgsm,
                predprob=predict(lmod,
                                 type="response")) # make pred data
gdf <- group_by(wcgsm, 
                cut(linpred,
                    breaks=unique(quantile(linpred, 
                                           (1:100)/101)))) # bin NA

Which gives me this error:

Error in `group_by()`:
! Problem adding computed columns.
Caused by error in `mutate()`:
! Problem while computing `..1 = cut(linpred, breaks = unique(quantile(linpred,
  (1:100)/101)))`.
✖ `..1` must be size 3140 or 1, not 3154.

I dont understand what this error means. When I run dim(wcgs), I get there are 3154 rows, and when I run dim(na.omit(wcgs)) I get 3140 rows. The only thing I can think of is that the predicted model values dont line up with the new na.omit data, but I'm not sure now how to work around that given the rest of this chapter uses this data manipulation.

CodePudding user response：

predict methods for R's modeling functions always predict from the original data set the models were fitted to. To have a new data set, in this case a subset of the data wcgs, argument newdata must be explicitly set.
The error in the predict line at the bottom is therefore expected behavior.

#### Load Data & Libraries ####
suppressPackageStartupMessages({
  #library(faraway)
  library(dplyr)
})
data(wcgs, package = "faraway")

#### Add Variables ####
wcgs$y <- as.integer(wcgs$chd == "yes") # create binary response from chd 
wcgs$bmi <- with(wcgs, 703*weight/(height^2)) # create BMI variable

#### Create GLM Model ####
lmod <- glm(chd ~ height   cigs, family = binomial, data = wcgs)

#### Mutate Data ####
# create residuals/pred values
wcgs <- mutate(wcgs,
               residuals = residuals(lmod), 
               linpred = predict(lmod)) 

wcgsm <- na.omit(wcgs) # omit NA values
wcgsm <- mutate(wcgsm,
                predprob = predict(lmod, type="response")) # make pred data
#> Error in `mutate()`:
#> ! Problem while computing `predprob = predict(lmod, type = "response")`.
#> ✖ `predprob` must be size 3140 or 1, not 3154.

^{Created on 2022-07-16 by the reprex package (v2.0.1)}

See where the error comes from.

predprob_all <- predict(lmod, type = "response")
predprob_na.omit <- predict(lmod, newdata = wcgsm, type = "response")

length(predprob_all)
#> [1] 3154
length(predprob_na.omit)
#> [1] 3140

^{Created on 2022-07-16 by the reprex package (v2.0.1)}

These lengths are the values in the error message, once again, as expected.

There is also the problem of the quantiles in cut(., breaks) not spanning the entire range of linpred. Values outside the quantiles' range will become NA. This is solved with the two endpoints of the breaks vector.
And I have given a name to the binned vector.
The following code works and, I believe, does what is needed.

wcgsm <- na.omit(wcgs) # omit NA values
wcgsm <- mutate(wcgsm,
                predprob = predict(lmod, newdata = wcgsm, type="response")) # make pred data

breaks <- c(-Inf, 
            unique(quantile(wcgsm$linpred, (1:100)/101)),
             Inf)
gdf <- group_by(wcgsm, 
                bins = cut(linpred, breaks = breaks)) # bin NA

anyNA(gdf$bins)
#> [1] FALSE

^{Created on 2022-07-16 by the reprex package (v2.0.1)}