Adding a column to Pandas Dataframe, randomly fill with values with percentage splits-CodePudding

I want to do a test, train, valid on a pandas dataframe, but I do not want to generate new data frames. Rather, I want to add a new column called 'Split' where Split = ['train','valid','test']. I want 'train', 'valid', 'test' to be distributed throughout 64%, 16%, and 20% of the rows randomly, respectively.

I know of scikit learn's train_test_split, but again, I don't want new frames. So I could try:

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

but I just want a column 'Split' with values of train, valid, and test as labels. This is for machine learning purposes so I would like to make sure the splits are completely random.

Does anyone know how this may be possible?

CodePudding user response：

Here's one way, using the suggested numpy.random.choice:

import pandas as pd
import numpy as np

# Set up a little example
data = np.ones(shape=(100, 3))
df = pd.DataFrame(data, columns=['x1', 'x2', 'y'])
df['split'] = pd.NA

# Split
split = ['train', 'valid', 'test']
df['split'] = df['split'].apply(lambda x: np.random.choice(split, p=[0.64, 0.16, 0.20]))

# Verify
df['split'].value_counts()

For one given run, this yielded

train    64
valid    19
test     17
Name: split, dtype: int64