I'm working with a huge dataset and I've to split it into two parts for the training and testing processes. I know that there is a specific function (sklearn.model_selection.train_test_split) but, since the database is unbalanced, I've to write my own function.
What I want to do is to divide the dataset into two parts according to the values 1 and 0 and then take a percentage of each (like 60% of 1 and 30% of 0) and save it into the train file. The remaining rows (40% and 70%) should be saved in the test file.
At the moment, I've made it in this way
def split_test_train (df, train_0, train_1, test_0, test_1, name=['column_name']):
dataframe_values_1 = df.loc[df[name]== 1] #all the rows with 1
dataframe_values_0 = df.loc[df[name] == 0] #all the rows with 0
data_train_zero= dataframe_values_0.iloc[:train_0, :]
data_train_one= dataframe_values_1.iloc[:train_1, :]
data_test_zero=dataframe_values_0.iloc[ -test_0:, :]
data_test_one=dataframe_values_1.iloc[ -test_0:, :]
data_train=pd.concat([data_train_zero,data_train_one])
data_test=pd.concat([data_test_zero,data_test_one])
..
..
return train, test
It's working but I don't want to manually compute the values of the rows to pass as parameters but split it automatically with a percentage.
I'm working on Google Colab.
CodePudding user response:
You can sample a given percentage of data using pandas.DataFrame.sample
method:
import numpy as np
import pandas as pd
p_ones, p_zeros = 0.6, 0.3 # 60% and 30% from your question
df_ones = df[df['target_name'] == 1] # data with labels 1
df_zeros = df[df['target_name'] == 0] # data with labels 0
# 60% of data with labels 1
train_df_ones = df_ones.sample(int(len(df_ones) * p_ones))
# 30% of data with labels 0
train_df_zeros = df_zeros.sample(int(len(df_zeros) * p_zeros))
# Training data with 60% 1s and 30% 0s
train_df = pd.concat([train_df_ones, train_df_zeros], axis=0)
# Test data with 40% 1s and 70% 0s
test_df = df[~df.index.isin(train_df.index)]