How to divide a dataframe into several dataframes-CodePudding

How to divide a large data frame having multiple categorical columns with multiple labels or classes in it.

For example, I'm having 1million rows with 100 columns and 50 columns having categorical data with different labels in it.

Now how to divide the data frame into 2 or 3 parts(or subsets) in which all labels in categorical columns should be present in the 2 or 3 subsets. Is it possible to do that for large datasets?

def rec():
    print('#rec Started')
    shuf_data = df.sample(frac=1)
    ran_data = np.random.rand(len(shuf_data)) < 0.5
    p_d = shuf_data[ran_data]
    d = shuf_data[~ran_data]

    def rrec(p_d,d):
        print('#rrec Started')
        for col in df_cat_cols:
            p_dcol = p_d[col].unique()
            dcol = d[col].unique()
            outcome = all(elem in p_dcol for elem in dcol)
            if outcome:
                print("Yes, list1 contains all elements in list2")
            else:
                print("No, list1 does not contains all elements in list2")
                return rec()
        return p_d,d

    return rrec(p_d,d)

The above code kills the process due to a very large dataset(1Million records). Please suggest a better and efficient answer. Thank You.

Here is an example:

Eg:
    Fruits  Color   Price
0   Banana  Yellow  60
1   Grape   Black   100
2   Apple   Red     200
3   Papaya  Yellow  50
4   Dragon  Pink    150
5   Mango   Yellow  400
6   Banana  Yellow  75
7   Grape   Black   106
8   Apple   Red     190
9   Papaya  Yellow  60
10  Dragon  Pink    120
11  Mango   Yellow  390

Expected 50:50 split:

df1:

3   Papaya  Yellow  50
4   Dragon  Pink    150
5   Mango   Yellow  400
6   Banana  Yellow  75
7   Grape   Black   106
8   Apple   Red     190

df2:
0   Banana  Yellow  60
1   Grape   Black   100
2   Apple   Red     200
9   Papaya  Yellow  60
10  Dragon  Pink    120
11  Mango   Yellow  390

CodePudding user response：

Why don't don't you try using train_test_split() method from sk-learn And OneHotEncoder() to break the categorical columns down. This is more of a machine learning approach, and I have used it to break the dataset with a 1million rows before, so it should work

CodePudding user response：

Yes, one way is to enumerate all rows with the same categories:

cat_cols = ['cat_col1', 'cat_col2']

groups = df.groupby(cat_cols).cumcount() // 3

sub_df = {g: d for g,d in df.groupby(groups)}