I am trying to create a function to run a k-fold cross validation on a glmer object. This is just data I got online (my dataset is quite large) so the model isn't the best but if I can get this to work using this data I should be able to switch it to my dataset quite easily.
I want to do a LOOCV(Leave One Out Cross-Validation)
"LOOCV(Leave One Out Cross-Validation) is a type of cross-validation approach in which each observation is considered as the validation set and the rest (N-1) observations are considered as the training set."
The outline I got was from Caroline's answer on this researchgate thread. https://www.researchgate.net/post/Does_R_code_for_k-fold_cross_validation_of_a_nested_glmer_model_exist
#load libraries
library(tidyverse)
library(optimx)
library(lme4)
#add example data
Data <- read.csv("https://stats.idre.ucla.edu/stat/data/hdp.csv")
Data <- select(Data, remission, IL6, CRP, DID)
Data
Data$remission<- as.factor(Data$remission)
Data$DID<- as.factor(Data$DID)
#add ROW column
Data <- Data %>% mutate(ROW = row_number())
head(Data)
PTOT=NULL
for (i in 1:8825) { # i in total number of observations in dataset
##Data that will be predicted
DataC1=Data[unique(Data$ROW)==i,]
###To train the model
DataCV=Data[unique(DataC1$ROW)!=i,]
M1 <- glmer(remission ~ 1 IL6 CRP ( 1 | DID ), data = DataCV, family = binomial, control = glmerControl(optimizer ='optimx', optCtrl=list(method='L-BFGS-B')))
P1=predict(M1, DataC1)
names(P1)=NULL
P1
PTOT= c(PTOT, P1)
}
R2cv=1-(sum((remission-PTOT)^2)/(length(PTOT))/(var(remission)))
This is the error I get "Error: Invalid grouping factor specification, DID"
CodePudding user response:
DataCV
is empty.
For example:
i <- 1 ## first time through the loop
DataCV=Data[unique(DataC1$ROW)!=i,]
I think that should have been DataC$ROW)
, not DataC1$ROW
.
A few other comments: a more compact version of your code would look something like this:
## fit the full model
M1 <- glmer(remission ~ 1 IL6 CRP ( 1 | DID ), data = DataC,
family = binomial, control = glmerControl(optimizer ='optimx', optCtrl=list(method='L-BFGS-B')))
res <- numeric(nrow(DataCV))
for (i in 1:nrow(DataCV)) {
new_fit <- update(M1, data = dataC[-i,]
res[i] <- (predict(new_fit, newdata=dataC[i,]) - remission[i])^2
}
For a well-specified model LOOCV is asymptotically equivalent to AIC, so you might be doing a lot of work to get something that's not very different from the AIC (which you can get directly from a single model fit) ...