how to do KFold cross validation with TXT files-CodePudding

for example I have 10 .txt files, in order to divide in test and train data.

(test_rate = 0.2 which means 2 test data and 8 train data)

In that case, the whole KFold cross validation should run 45 times (C[10,2])

how to do this in python? using sklearn's KFold function(code below) or other methods. Much thanks for your reply.

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
KFold(n_splits=2, random_state=None, shuffle=False)

CodePudding user response：

Yes, you can use sklearn. You should use 5-Fold cross validation if you want your test data to be 0.2 of the whole dataset. Because in 5-Fold CV, you divide your data into 5 splits and use 4 of them for training, remaining 1 for testing, each time. So, n_splits should be 5.

fnames = np.array([
  "1.txt", 
  "2.txt", 
  "3.txt", 
  "4.txt", 
  "5.txt", 
  "6.txt", 
  "7.txt", 
  "8.txt", 
  "9.txt", 
  "10.txt"
])
kfold = KFold(n_splits=5)

for i, (train_idx, test_idx) in enumerate(kfold.split(fnames)):
  print(f"Fold {i}")
  train_fold, test_fold = fnames[train_idx], fnames[test_idx]
  print(f"\tlen train fold: {len(train_fold)}")
  print(f"\tTrain fold: {train_fold}")
  print(f"\tlen test fold: {len(test_fold)}")
  print(f"\tTest fold: {test_fold}")

This prints

Fold 0
    len train fold: 8
    Train fold: ['3.txt' '4.txt' '5.txt' '6.txt' '7.txt' '8.txt' '9.txt' '10.txt']
    len test fold: 2
    Test fold: ['1.txt' '2.txt']
Fold 1
    len train fold: 8
    Train fold: ['1.txt' '2.txt' '5.txt' '6.txt' '7.txt' '8.txt' '9.txt' '10.txt']
    len test fold: 2
    Test fold: ['3.txt' '4.txt']
Fold 2
    len train fold: 8
    Train fold: ['1.txt' '2.txt' '3.txt' '4.txt' '7.txt' '8.txt' '9.txt' '10.txt']
    len test fold: 2
    Test fold: ['5.txt' '6.txt']
Fold 3
    len train fold: 8
    Train fold: ['1.txt' '2.txt' '3.txt' '4.txt' '5.txt' '6.txt' '9.txt' '10.txt']
    len test fold: 2
    Test fold: ['7.txt' '8.txt']
Fold 4
    len train fold: 8
    Train fold: ['1.txt' '2.txt' '3.txt' '4.txt' '5.txt' '6.txt' '7.txt' '8.txt']
    len test fold: 2
    Test fold: ['9.txt' '10.txt']

You may want to give shuffle=True and a random_state in KFold for reproducibility.