Creating function to run k-fold cross validation on glmer object (Leave One Out Cross-Validation)-CodePudding

I am trying to create a function to run a k-fold cross validation on a glmer object. This is just data I got online (my dataset is quite large) so the model isn't the best but if I can get this to work using this data I should be able to switch it to my dataset quite easily.

I want to do a LOOCV(Leave One Out Cross-Validation)

"LOOCV(Leave One Out Cross-Validation) is a type of cross-validation approach in which each observation is considered as the validation set and the rest (N-1) observations are considered as the training set."

The outline I got was from Caroline's answer on this researchgate thread. https://www.researchgate.net/post/Does_R_code_for_k-fold_cross_validation_of_a_nested_glmer_model_exist


#load libraries 
library(tidyverse)
library(optimx)
library(lme4)

#add example data 
Data <- read.csv("https://stats.idre.ucla.edu/stat/data/hdp.csv")
Data <- select(Data, remission, IL6, CRP, DID)
Data
Data$remission<- as.factor(Data$remission)
Data$DID<- as.factor(Data$DID)

#add ROW column  
Data <- Data %>% mutate(ROW = row_number())
head(Data)
PTOT=NULL
for (i in 1:8825) { # i in total number of observations in dataset 
  ##Data that will be predicted
  DataC1=Data[unique(Data$ROW)==i,]
  ###To train the model
  DataCV=Data[unique(DataC1$ROW)!=i,]
  M1 <- glmer(remission ~ 1   IL6   CRP   ( 1 | DID ), data = DataCV, family = binomial, control = glmerControl(optimizer ='optimx', optCtrl=list(method='L-BFGS-B')))
  P1=predict(M1, DataC1)
  names(P1)=NULL
  P1
  PTOT= c(PTOT, P1)
}

R2cv=1-(sum((remission-PTOT)^2)/(length(PTOT))/(var(remission)))

This is the error I get "Error: Invalid grouping factor specification, DID"

CodePudding user response：

DataCV is empty.

For example:

i <- 1  ## first time through the loop
DataCV=Data[unique(DataC1$ROW)!=i,]

I think that should have been DataC$ROW), not DataC1$ROW.

A few other comments: a more compact version of your code would look something like this:

## fit the full model
M1 <- glmer(remission ~ 1   IL6   CRP   ( 1 | DID ), data = DataC, 
   family = binomial, control = glmerControl(optimizer ='optimx', optCtrl=list(method='L-BFGS-B')))
res <- numeric(nrow(DataCV))
for (i in 1:nrow(DataCV)) {
   new_fit <- update(M1, data = dataC[-i,]
   res[i] <- (predict(new_fit, newdata=dataC[i,]) - remission[i])^2
}

For a well-specified model LOOCV is asymptotically equivalent to AIC, so you might be doing a lot of work to get something that's not very different from the AIC (which you can get directly from a single model fit) ...