for example I have 10 .txt files, in order to divide in test and train data.
(test_rate = 0.2 which means 2 test data and 8 train data)
In that case, the whole KFold cross validation should run 45 times (C[10,2])
how to do this in python? using sklearn's KFold function(code below) or other methods. Much thanks for your reply.
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
KFold(n_splits=2, random_state=None, shuffle=False)
CodePudding user response:
Yes, you can use sklearn
. You should use 5-Fold cross validation if you want your test data to be 0.2
of the whole dataset. Because in 5-Fold CV, you divide your data into 5 splits and use 4 of them for training, remaining 1 for testing, each time. So, n_splits
should be 5.
fnames = np.array([
"1.txt",
"2.txt",
"3.txt",
"4.txt",
"5.txt",
"6.txt",
"7.txt",
"8.txt",
"9.txt",
"10.txt"
])
kfold = KFold(n_splits=5)
for i, (train_idx, test_idx) in enumerate(kfold.split(fnames)):
print(f"Fold {i}")
train_fold, test_fold = fnames[train_idx], fnames[test_idx]
print(f"\tlen train fold: {len(train_fold)}")
print(f"\tTrain fold: {train_fold}")
print(f"\tlen test fold: {len(test_fold)}")
print(f"\tTest fold: {test_fold}")
This prints
Fold 0
len train fold: 8
Train fold: ['3.txt' '4.txt' '5.txt' '6.txt' '7.txt' '8.txt' '9.txt' '10.txt']
len test fold: 2
Test fold: ['1.txt' '2.txt']
Fold 1
len train fold: 8
Train fold: ['1.txt' '2.txt' '5.txt' '6.txt' '7.txt' '8.txt' '9.txt' '10.txt']
len test fold: 2
Test fold: ['3.txt' '4.txt']
Fold 2
len train fold: 8
Train fold: ['1.txt' '2.txt' '3.txt' '4.txt' '7.txt' '8.txt' '9.txt' '10.txt']
len test fold: 2
Test fold: ['5.txt' '6.txt']
Fold 3
len train fold: 8
Train fold: ['1.txt' '2.txt' '3.txt' '4.txt' '5.txt' '6.txt' '9.txt' '10.txt']
len test fold: 2
Test fold: ['7.txt' '8.txt']
Fold 4
len train fold: 8
Train fold: ['1.txt' '2.txt' '3.txt' '4.txt' '5.txt' '6.txt' '7.txt' '8.txt']
len test fold: 2
Test fold: ['9.txt' '10.txt']
You may want to give shuffle=True
and a random_state
in KFold
for reproducibility.