I want to do a test, train, valid on a pandas dataframe, but I do not want to generate new data frames. Rather, I want to add a new column called 'Split' where Split = ['train','valid','test']
. I want 'train'
, 'valid'
, 'test'
to be distributed throughout 64%
, 16%
, and 20%
of the rows randomly, respectively.
I know of scikit learn's train_test_split, but again, I don't want new frames. So I could try:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
but I just want a column 'Split' with values of train, valid, and test as labels. This is for machine learning purposes so I would like to make sure the splits are completely random.
Does anyone know how this may be possible?
CodePudding user response:
Here's one way, using the suggested numpy.random.choice
:
import pandas as pd
import numpy as np
# Set up a little example
data = np.ones(shape=(100, 3))
df = pd.DataFrame(data, columns=['x1', 'x2', 'y'])
df['split'] = pd.NA
# Split
split = ['train', 'valid', 'test']
df['split'] = df['split'].apply(lambda x: np.random.choice(split, p=[0.64, 0.16, 0.20]))
# Verify
df['split'].value_counts()
For one given run, this yielded
train 64
valid 19
test 17
Name: split, dtype: int64