How to divide a large data frame having multiple categorical columns with multiple labels or classes in it.
For example, I'm having 1million rows with 100 columns and 50 columns having categorical data with different labels in it.
Now how to divide the data frame into 2 or 3 parts(or subsets) in which all labels in categorical columns should be present in the 2 or 3 subsets. Is it possible to do that for large datasets?
def rec():
print('#rec Started')
shuf_data = df.sample(frac=1)
ran_data = np.random.rand(len(shuf_data)) < 0.5
p_d = shuf_data[ran_data]
d = shuf_data[~ran_data]
def rrec(p_d,d):
print('#rrec Started')
for col in df_cat_cols:
p_dcol = p_d[col].unique()
dcol = d[col].unique()
outcome = all(elem in p_dcol for elem in dcol)
if outcome:
print("Yes, list1 contains all elements in list2")
else:
print("No, list1 does not contains all elements in list2")
return rec()
return p_d,d
return rrec(p_d,d)
The above code kills the process due to a very large dataset(1Million records). Please suggest a better and efficient answer. Thank You.
Here is an example:
Eg:
Fruits Color Price
0 Banana Yellow 60
1 Grape Black 100
2 Apple Red 200
3 Papaya Yellow 50
4 Dragon Pink 150
5 Mango Yellow 400
6 Banana Yellow 75
7 Grape Black 106
8 Apple Red 190
9 Papaya Yellow 60
10 Dragon Pink 120
11 Mango Yellow 390
Expected 50:50 split:
df1:
3 Papaya Yellow 50
4 Dragon Pink 150
5 Mango Yellow 400
6 Banana Yellow 75
7 Grape Black 106
8 Apple Red 190
df2:
0 Banana Yellow 60
1 Grape Black 100
2 Apple Red 200
9 Papaya Yellow 60
10 Dragon Pink 120
11 Mango Yellow 390
CodePudding user response:
Why don't don't you try using train_test_split() method from sk-learn And OneHotEncoder() to break the categorical columns down. This is more of a machine learning approach, and I have used it to break the dataset with a 1million rows before, so it should work
CodePudding user response:
Yes, one way is to enumerate all rows with the same categories:
cat_cols = ['cat_col1', 'cat_col2']
groups = df.groupby(cat_cols).cumcount() // 3
sub_df = {g: d for g,d in df.groupby(groups)}