Home > Software design >  Actual data from KFold split indices
Actual data from KFold split indices

Time:11-03

Suppose I have the following data:

y = np.ones(10)
y[-5:] = 0
X = pd.DataFrame({'a':np.random.randint(10,20, size=(10)),
                  'b':np.random.randint(80,90, size=(10))})
X    
    a   b
0   11  82
1   19  82
2   15  80
3   15  86
4   14  82
5   18  87
6   13  83
7   12  83
8   10  82
9   18  87

Splitting it to 5-fold gives the following indices:

kf =  KFold()
data = list(kf.split(X,y))
data
[(array([2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1])),
 (array([0, 1, 4, 5, 6, 7, 8, 9]), array([2, 3])),
 (array([0, 1, 2, 3, 6, 7, 8, 9]), array([4, 5])),
 (array([0, 1, 2, 3, 4, 5, 8, 9]), array([6, 7])),
 (array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))]

But I want to further prepare data such that it is organise to contain the actual values in the format:

data =
   [(train1,trainlabel1,test1,testlabel1),
    (train2,trainlabel2,test2,testlabel2),
     ..,
    (train5,trainlabel5,test5,testlabel5)]

Expected Output (from the given MWE):

[array([
        (array([[15,80],[15,86],[14,82],[18,87],[13,83],[12,83],[10,82],[18,87]]), array([[1],[1],[1],[0],[0],[0],[0],[0])]), #fold1 train/label
        (array([[11,82],[19,82]]), array([[1],[1]])),  #fold1 test/label

        (array([[11,82],[19,82],[14,82],[18,87],[13,83],[12,83],[10,82],[18,87]]),array([[1],[1],[1],[0],[0],[0],[0],[0]])), #fold2 train/label
        (array([[15,80],[15,86]]),array([[1],[1]])) #fold2 test/label

        ....
])]

CodePudding user response:

Actually the answer from @hotuagia is correct. You're getting this error because you tried to access elements of y which is an array using loc which is a dataframe attribute. A handy way would be to transform y to a pandas Dataframe or Series before passing to KFold.

So:

y = np.ones(10) 
y[-5:] = 0
X = pd.DataFrame({'a':np.random.randint(10,20, size=(10)),
                  'b':np.random.randint(80,90, size=(10))})
# y- array to pandas df or series
y = pd.DataFrame(y) # or pd.Series(y)

Then proceed with @hotuagia's answer:

for train_idx, test_idx in KFold(n_splits=2).split(X):
   x_train = X.loc[train_idx]
   x_test = X.loc[test_idx]

   y_train = y.loc[train_idx]
   y_test = y.loc[test_idx]

CodePudding user response:

As you understand, KFold().split(data) returns the selected indices by fold. To select Pandas.DataFrame rows with indices list, the easiest way is the loc method.

for train_idx, test_idx in KFold(n_splits=2).split(X):
   x_train = X.loc[train_idx]
   x_test = X.loc[test_idx]

   y_train = y.loc[train_idx]
   y_test = y.loc[test_idx]

You can then add you subset dataframes to lists

  • Related