Splitting up a dataset in python-CodePudding

I have a dataset with about 500,000 records and they are grouped. I would like to shuffle and split into 10 smaller datasets based on the percentage weightings of each group. I want each dataset to contain all groups. Is there a library or method to do this in python?

I tried arry_split which just splits the dataset without stratification
Stratification on sckit learn does not really help since it uses training and test splits

CodePudding user response：

You can use the sklearn.model_selection.StratifiedShuffleSplit class to accomplish this. The class can be used to create stratified random splits of a dataset, where the proportion of samples for each class is approximately the same in each split. You can set the n_splits parameter to 10 to generate 10 splits, and the test_size parameter to the desired percentage weighting for each group. Here's an example of how you can use this class:

from sklearn.model_selection import StratifiedShuffleSplit

# Create the splits
splitter = StratifiedShuffleSplit(n_splits=10, test_size=0.1, random_state=42)

# Iterate through the splits
for train_index, test_index in splitter.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Do something with the splits

You will have to first convert your dataset into format that is acceptable by sklearn functions. It requires X and y as input where X is the feature set and y is the target variable.

CodePudding user response：

You can use k-fold splitting to achieve what you're looking for. Something like

folds = list(StratifiedKFold(n_splits=k, shuffle=True, random_state=1).split(X_train, y_train))

See the documentation here https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

CodePudding user response：

One way to achieve this is by using the pandas library to group the data by the group column, shuffle the data within each group, and then split the data into smaller datasets based on the percentage weightings. Here's an example of how you can do this:

import pandas as pd

# assuming 'data' is your dataset and 'groups' is the column in the dataframe that contains the group information

# Group the data by the group column
grouped_data = data.groupby('groups')

# Shuffle the data within each group
shuffled_data = grouped_data.apply(lambda x: x.sample(frac=1))

# Get the total number of records for each group
group_counts = grouped_data.size()

# Create a dictionary to store the 10 datasets
datasets = {}

# Iterate 10 times to create 10 datasets
for i in range(10):
    current_dataset = pd.DataFrame()
    for group, count in group_counts.items():
        # Get the percentage of records for each group
        group_percentage = count / len(data)
        # Get the number of records for each group in the current dataset
        group_count_in_dataset = int(group_percentage * len(data) / 10)
        # Append the records for the current group to the current dataset
        current_dataset = current_dataset.append(shuffled_data.loc[group].head(group_count_in_dataset))
    datasets[f'dataset_{i}'] = current_dataset

This will ensure that each dataset contains all groups with the same percentage weightings of the original dataset.