In the scenario of having three sets
- A
train
set of e.g. 80% (for model training) - A
validation
set of e.g. 10% (for model training) - A
test
set of e.g. 10% (for final model testing)
let's say I perform k-fold cross validation (CV) on the example dataset of [1,2,3,4,5,6,7,8,9,10]
. Let's also say
10
is thetest
set in this example- the remaining
[1,2,3,4,5,6,7,8,9]
will be used for training and validation
leave-one-out CV would than look something like this
# Fold 1
[2, 3, 4, 5, 6, 7, 8, 9] # train
[1] # validation
# Fold 2
[1, 3, 4, 5, 6, 7, 8, 9] # train
[2] # validation
# Fold 3
[1, 2, 4, 5, 6, 7, 8, 9] # train
[3] # validation
# Fold 4
[1, 2, 3, 5, 6, 7, 8, 9] # train
[4] # validation
# Fold 5
[1, 2, 3, 4, 6, 7, 8, 9] # train
[5] # validation
# Fold 6
[1, 2, 3, 4, 5, 7, 8, 9] # train
[6] # validation
# Fold 7
[1, 2, 3, 4, 5, 6, 8, 9] # train
[7] # validation
# Fold 8
[1, 2, 3, 4, 5, 6, 7, 9] # train
[8] # validation
# Fold 9
[1, 2, 3, 4, 5, 6, 7, 8] # train
[9] # validation
Great, now the model has been built and validation using each data point of the combined train
and validation
set once.
Next, I would test my model on the test
set (10
) and get some performance.
What I was wondering now is why we not also perform CV using the test
set and average the result to see the impact of different test sets? Meaning why we don't do the above process 10 times such that we have each data point also in the test
set?
It would be obviously computationally extremely expensive but I was thinking about that cause it seemed difficult to choose an appropriate test
set. For example, it could be that my model from above would have performed much differently when I would have chosen 1
as the test set and trained and validated on the remaining points.
I wondered about this in scenarios where I have groups in my data. For example
[1,2,3,4]
comes from groupA
,[5,6,7,8]
comes from groupB
and[9,10]
comes from groupC
.
In this case when choosing 10
as the test set, it could perform much differently than choosing 1
right, or am I missing something here?
CodePudding user response:
All your train-validation-test splits should be randomly sampled and sufficiently big. Hence if your data comes from different groups you should have roughly the same distribution of groups across train, validation and test pools. If your test performance varies based on the sampling seed you're definitely doing something wrong.
As to why not use test set for cross-validation, this would result in overfitting. Usually you would run your cross-validation many times with different hyperparameters and use cv score to select best models. If you don't have a separate test set to evaluate your model at the end of model selection you would never know if you overfitted to the training pool during model selection iterations.