Sorting train_test_split data by numpy array-CodePudding

I want to split the following numpy arrays for training and testing: X, y and qid

X is a set of featurized documents - shape: (140, 105)
qid is a set of query identifiers for each document - shape: (140,)
y is a set of labels for each (X, qid) pair - shape: (140,)

At the moment, what I do for splitting is:

# Split documents, labels, and query_ids into training (70%) and testing (30%)
    X_tr, X_tst, y_tr, y_tst, qid_tr, qid_tst= train_test_split(X, y, qid, test_size=0.3, random_state=1, shuffle=True, stratify=qid)

The problem is that after splitting, I need the returning numpy arrays to be sorted by qid. That is, all the documents with the same qid need to be together (one after another) as a block (both in training and testing).

Example

Correct split:

X              qid           y       
------------------------------
document 1     0             0
document 5     0             1
document 4     1             1
document 6     1             0
document 9     2             1

Incorrect split:

X              qid           y       
------------------------------
document 1     0             0
document 4     1             1
document 9     2             1
document 5     0             1
document 6     1             0

Is there any way to make this possible?

CodePudding user response：

There is a very simple way to split data into training and test set. While splitting you want to maintain two things:

Your data is shuffled properly, Usually, we have data set in some order and we want to shuffle properly to get better results,
You must get the same set of rows in train and test split each time.

For that, you can simply create a df by joining all your X and qid and y dfs. and then use pandas to shuffle and split into train and test set.

import pandas as pd 

# Shuffle your dataset 
shuffle_df = df.sample(frac=1)

# Define a size for your train set 
train_size = int(0.7 * len(df))

# Split your dataset 
train_set = shuffle_df[:train_size]
test_set = shuffle_df[train_size:]

Now you can sort the training set based on qid column and split it into multiple dfs to obtain X_train, y_train and qid_train. Do same thing for test set.