How do I utilize imputed data, with categorical levels, in a prediction in R?-CodePudding

I'll illustrate my problem with the iris data set in R. My objective here is to create 5 imputed data sets, fit a regression to each imputed data set, then pool together the results of these regressions into one final model. This is the preferred order of operations for a proper execution of multiple imputation.

library(mice)

df <- iris
# Inject some missingness into the data:
df$Sepal.Width[c(20,40,70,121)] <- NA
df$Species[c(15,80,99,136)] <- NA
# Perform the standard steps of multiple imputation with MICE:
imputed_data <- mice(df, method = c(rep("pmm", 5)), m = 5, maxit = 5)
model <- with(imputed_data, lm(Sepal.Length ~ Sepal.Width   Species))
pooled_model <- pool(model)

This leaves me with this pooled_model object which I am hoping to use as a fitted model in the predict command. However, that does not work. When I run:

predict(pooled_model, newdata = iris)

I get this error:

Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "c('mipo', 'data.frame')"

Disregard the reasoning why I am using the original iris data set in my newly fitted model; I simply want to be able to fit this data, or a subset of it, onto the model I created with my imputation.

I specifically chose a data set with multiple levels of a categorical variable to highlight my problem. I thought about using some matrix multiplication with which I could do this manually, but the presence of a categorical variable makes that tough. In my actual data set, I have over a hundred variables, many of which have multiple categorical levels. I say this because I realize one possible solution would be to re-code my categorical variables into dummy variables, and then I can apply some matrix multiplication to get my answer. But that would be an EXTREME amount of work for me. If there's a way I can somehow get a model object I can use in the predict function, that would make my life 100x easier.

Any suggestions?

CodePudding user response：

You have two issues: 1) how to use stats::predict with pooled data and 2) what to do about your categorical variables.

Your first issue has already been documented on the mice Github page and it seems like there's been a desire to have a predict.mira function for a while. The author of the mice package posted some code on how to simulate a predict.mira-like function. Unfortunately, it only works with lm models, but it seems like that's okay considering your reprex. If you have a Github account, you can comment on that Github issue to demonstrate your interest in the predict.mira function.

Your question also has been posted on StackOverflow before; although the answer was never accepted, the SO user suggested this reading by Miles (2015).

For your second question, have you considered leaving out your current method argument when using mice()? As long as your variables have been classed as factors, then mice will default to the polyreg method for categorical variables and pmm for continuous variables. You can read more about the method argument here.

library(mice)
set.seed(123)

# make missing data
df <- iris
df$Sepal.Width[c(20,40,70,121)] <- NA
df$Species[c(15,80,99,136)] <- NA

# specify method
meth <- mice(df, maxit = 0, printFlag = FALSE)$meth
print(meth)

# this is how you would change your methods, if you wanted
# but pmm and polyreg are defaults
meth["Species"] <- "polr"
meth["Sepal.Width"] <- "midastouch"
print(meth)

# impute
imputed_data <- mice(df, 
                     m = 5,
                     maxit = 5, 
                     method = meth,  # new method
                     printFlag = FALSE)

# make model
model <- with(imputed_data, lm(Sepal.Length ~ Sepal.Width   Species))
summary(pool(model))

# obtain predictions Q and prediction variance U
predm <- lapply(getfit(model), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df

# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
               dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
  pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom   1)
  pred[i, 1] <- pi[["qbar"]]
  pred[i, 2] <- sqrt(pi[["t"]])
  pred[i, 3] <- pi[["df"]]
}

head(pred)