I want to split the following numpy arrays for training and testing: X
, y
and qid
X
is a set of featurized documents - shape: (140, 105)qid
is a set of query identifiers for each document - shape: (140,)y
is a set of labels for each (X
,qid
) pair - shape: (140,)
At the moment, what I do for splitting is:
# Split documents, labels, and query_ids into training (70%) and testing (30%)
X_tr, X_tst, y_tr, y_tst, qid_tr, qid_tst= train_test_split(X, y, qid, test_size=0.3, random_state=1, shuffle=True, stratify=qid)
The problem is that after splitting, I need the returning numpy arrays to be sorted by qid
. That is, all the documents with the same qid
need to be together (one after another) as a block (both in training and testing).
Example
Correct split:
X qid y
------------------------------
document 1 0 0
document 5 0 1
document 4 1 1
document 6 1 0
document 9 2 1
Incorrect split:
X qid y
------------------------------
document 1 0 0
document 4 1 1
document 9 2 1
document 5 0 1
document 6 1 0
Is there any way to make this possible?
CodePudding user response:
There is a very simple way to split data into training and test set. While splitting you want to maintain two things:
- Your data is shuffled properly, Usually, we have data set in some order and we want to shuffle properly to get better results,
- You must get the same set of rows in train and test split each time.
For that, you can simply create a df by joining all your X and qid and y dfs. and then use pandas to shuffle and split into train and test set.
import pandas as pd
# Shuffle your dataset
shuffle_df = df.sample(frac=1)
# Define a size for your train set
train_size = int(0.7 * len(df))
# Split your dataset
train_set = shuffle_df[:train_size]
test_set = shuffle_df[train_size:]
Now you can sort the training set based on qid column and split it into multiple dfs to obtain X_train, y_train and qid_train. Do same thing for test set.