R GLM Predict Error - factor has new levels-CodePudding

I'm doing a basic logistic regression using glm()

I split the data into train and test, built the model using glm, and then tried running predict() using the test data.

Here is the code

data = read.csv('2022_data.csv')

data$A= as.factor(data$A)
data$B= as.factor(data$B)

# split train and test
df = sort(sample(nrow(data), nrow(data)*.8))
df_train = data[df,]
df_test = data[-df,]

# create model
model1 = glm(attrition ~ A  B   C   D   E, data = df_train, family = binomial)

predict1 = predict(model1, df_test1, type='response')

I encountered

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor A has new levels

I understand that this error message means there is a value in column A that is not accounted for in the model. But I checked the unique values for column A in training and testing data, and both have the exact same values

levels(as.factor(df_test1$A))
levels(as.factor(df_train$A))

Both returns

[1] ""  "N" "Y"

I'm not sure what I'm missing here

CodePudding user response：

The thing about factors is that all the levels are stored in the metadata for the column whether or not the value is actually reflected in the data after subsetting.

So, you may have trained on data with two of three levels but not the third, that then shows up in the test data. (without seeing data and basic descriptive statistics I cannot be sure)

However, you can test this by running the following code to see what I mean:

x<-as.factor(x<-c("A", "B", "C","A", "B", "C","A", "B", "C","A", "B", "C"))
y<-x[1:2]

When you look at why this is what you see

 y
[1] A B
Levels: A B C

If you want to be sure that all values of the levels are reflected in your coefficients from training you should use a stratified sampling method to account for all levels in the data.

I would check before you go too far to see that there are enough of each level to be meaningful.

> table(x)
x
A B C 
4 4 4

If you only have a couple of one level you have bigger problems to consider.

CodePudding user response：

I'd try

library(forcats)
df_test1$A <- df_test1$A |> fct_drop(c(""))

Your error refers to model.frame.default. I am wondering if the "" levels aren't used in the model, and then found in test. Or you might want to assign "" levels to "Y" or "N".