Error in generalized linear mixed model cross-validation: The value in 'data[[cat

I am trying to conduct a 5-fold cross validation on a generalized linear mixed model using the groupdata2 and cvms packages. This is the code I tried to run:

data <- groupdata2::fold(detect, k = 5,
                            cat_col = 'outcome',
                            id_col = 'bird') %>% 
                            arrange(.folds)

cvms::cross_validate(
data,
"outcome ~ sex    year   season   (1 | bird)   (1 | obsname)",
family="binomial",
fold_cols = ".folds",
control = NULL,
REML = FALSE)

This is the error I receive:

Error in groupdata2::fold(detect, k = 4, cat_col = "outcome", id_col = "bird") %>%  : 
  1 assertions failed:
 * The value in 'data[[cat_col]]' must be constant within each ID.

In the package vignette, the following explanation is given: "A participant must always have the same diagnosis (‘a’ or ‘b’) throughout the dataset. Otherwise, the participant might be placed in multiple folds." This makes sense in the example. However, my data is based on the outcome of resighting birds, so outcome varies depending on whether the bird was observed on that particular survey. Is there a way around this?

Reproducible example:

bird <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
outcome <- c(0,1,1,1,0,0,0,1,0,1,0,1,0,0,1)
df <- data.frame(bird, outcome)
df$outcome <- as.factor(df$outcome)
df$bird <- as.factor(df$bird)

data <- groupdata2::fold(df, k = 5,
cat_col = 'outcome',
id_col = 'bird') %>% 
arrange(.folds)

CodePudding user response：

The full documentation says:

cat_col: Name of categorical variable to balance between folds. E.g. when predicting a binary variable (a or b), we usually want both classes represented in every fold. N.B. If also passing an ‘id_col’, ‘cat_col’ should be constant within each ID.

So in this case, where outcome varies within individual birds (id_col), you simply can't specify that the folds be balanced within respect to the outcome. (I don't 100% understand this constraint in the software: it seems it should be possible to do at least approximate balancing by selecting groups (birds) with a balanced range of outcomes, but I can see how it could make the balancing procedure a lot harder).

In my opinion, though, the importance of balancing outcomes is somewhat overrated in general. Lack of balance would mean that some of the simpler metrics in ?binomial_metrics (e.g. accuracy, sensitivity, specificity) are not very useful, but others (balanced accuracy, AUC, aic) should be fine.

A potentially greater problem is that you appear to have (potentially) crossed random effects (i.e. (1|bird) (1|obsname)). I'm guessing obsname is the name of an observer: if some observers detected (or failed to detect) multiple birds and some birds were detected/failed by multiple observers, then there may be no way to define folds that are actually independent, or at least it may be very difficult.