Suppose I have the following data:
y = np.ones(10)
y[-5:] = 0
X = pd.DataFrame({'a':np.random.randint(10,20, size=(10)),
'b':np.random.randint(80,90, size=(10))})
X
a b
0 11 82
1 19 82
2 15 80
3 15 86
4 14 82
5 18 87
6 13 83
7 12 83
8 10 82
9 18 87
Splitting it to 5-fold gives the following indices:
kf = KFold()
data = list(kf.split(X,y))
data
[(array([2, 3, 4, 5, 6, 7, 8, 9]), array([0, 1])),
(array([0, 1, 4, 5, 6, 7, 8, 9]), array([2, 3])),
(array([0, 1, 2, 3, 6, 7, 8, 9]), array([4, 5])),
(array([0, 1, 2, 3, 4, 5, 8, 9]), array([6, 7])),
(array([0, 1, 2, 3, 4, 5, 6, 7]), array([8, 9]))]
But I want to further prepare data
such that it is organise to contain the actual values in the format:
data =
[(train1,trainlabel1,test1,testlabel1),
(train2,trainlabel2,test2,testlabel2),
..,
(train5,trainlabel5,test5,testlabel5)]
Expected Output (from the given MWE):
[array([
(array([[15,80],[15,86],[14,82],[18,87],[13,83],[12,83],[10,82],[18,87]]), array([[1],[1],[1],[0],[0],[0],[0],[0])]), #fold1 train/label
(array([[11,82],[19,82]]), array([[1],[1]])), #fold1 test/label
(array([[11,82],[19,82],[14,82],[18,87],[13,83],[12,83],[10,82],[18,87]]),array([[1],[1],[1],[0],[0],[0],[0],[0]])), #fold2 train/label
(array([[15,80],[15,86]]),array([[1],[1]])) #fold2 test/label
....
])]
CodePudding user response:
Actually the answer from @hotuagia is correct. You're getting this error because you tried to access elements of y
which is an array using loc
which is a dataframe attribute. A handy way would be to transform y
to a pandas Dataframe
or Series
before passing to KFold
.
So:
y = np.ones(10)
y[-5:] = 0
X = pd.DataFrame({'a':np.random.randint(10,20, size=(10)),
'b':np.random.randint(80,90, size=(10))})
# y- array to pandas df or series
y = pd.DataFrame(y) # or pd.Series(y)
Then proceed with @hotuagia's answer:
for train_idx, test_idx in KFold(n_splits=2).split(X):
x_train = X.loc[train_idx]
x_test = X.loc[test_idx]
y_train = y.loc[train_idx]
y_test = y.loc[test_idx]
CodePudding user response:
As you understand, KFold().split(data)
returns the selected indices by fold.
To select Pandas.DataFrame rows with indices list, the easiest way is the loc method.
for train_idx, test_idx in KFold(n_splits=2).split(X):
x_train = X.loc[train_idx]
x_test = X.loc[test_idx]
y_train = y.loc[train_idx]
y_test = y.loc[test_idx]
You can then add you subset dataframes to lists