how to split a database according to a specific condition in python?-CodePudding

I'm working with a huge dataset and I've to split it into two parts for the training and testing processes. I know that there is a specific function (sklearn.model_selection.train_test_split) but, since the database is unbalanced, I've to write my own function.

What I want to do is to divide the dataset into two parts according to the values 1 and 0 and then take a percentage of each (like 60% of 1 and 30% of 0) and save it into the train file. The remaining rows (40% and 70%) should be saved in the test file.

At the moment, I've made it in this way

def split_test_train (df, train_0, train_1, test_0, test_1, name=['column_name']):
  dataframe_values_1 = df.loc[df[name]== 1] #all the rows with 1
  dataframe_values_0 = df.loc[df[name] == 0] #all the rows with 0

  data_train_zero= dataframe_values_0.iloc[:train_0, :]
  data_train_one= dataframe_values_1.iloc[:train_1, :]
  data_test_zero=dataframe_values_0.iloc[ -test_0:, :]
  data_test_one=dataframe_values_1.iloc[ -test_0:, :]

  data_train=pd.concat([data_train_zero,data_train_one])
  data_test=pd.concat([data_test_zero,data_test_one])
  ..
  ..
  return train, test

It's working but I don't want to manually compute the values of the rows to pass as parameters but split it automatically with a percentage.

I'm working on Google Colab.

CodePudding user response：

You can sample a given percentage of data using pandas.DataFrame.sample method:

import numpy as np
import pandas as pd

p_ones, p_zeros = 0.6, 0.3  # 60% and 30% from your question
df_ones = df[df['target_name'] == 1]  # data with labels 1
df_zeros = df[df['target_name'] == 0]  # data with labels 0
# 60% of data with labels 1
train_df_ones = df_ones.sample(int(len(df_ones) * p_ones))
# 30% of data with labels 0
train_df_zeros = df_zeros.sample(int(len(df_zeros) * p_zeros))
# Training data with 60% 1s and 30% 0s
train_df = pd.concat([train_df_ones, train_df_zeros], axis=0)
# Test data with 40% 1s and 70% 0s
test_df = df[~df.index.isin(train_df.index)]