I have a data frame called df
of which its value counts are the following:
df.Priority.value_counts()
P3 39506
P2 3038
P4 1138
P1 1117
P5 252
Name: Priority, dtype: int64
I am trying to create a balanced dataset called df_balanced
from df
by restricting the number of entries in the P3
category to 5000. The expected output should look like this!
P3 5000
P2 3038
P4 1138
P1 1117
P5 252
Name: Priority, dtype: int64
I tried the following code:
s0 = df.Priority[df.Priority.eq('P3')].sample(5000).index
df_balanced = df.loc[s0.union(df)].reset_index(drop=True, inplace=True) # I am unsure how to exclude the entries of `P3` categories from `df`!
I used this as a reference: Randomly selecting rows from a dataframe based on a column value but the solution provided isn't optimal for more than 2 categories.
CodePudding user response:
A possible solution:
import random
# this is the maximum limit of elements of P1, which will be
# randomly chosen
maxlim_catP1 = 4
df.groupby('X').apply(
lambda g: g.loc[random.sample(g.index.to_list(), min(maxlim_catP1, len(g))), :] if
(g.loc[g.index[0], 'X'] == 'P1') else g)
Output:
X Y
X
P1 2 P1 c
3 P1 d
0 P1 a
1 P1 b
P2 4 P2 e
6 P2 g
7 P2 h
Data:
X Y
0 P1 a
1 P1 b
2 P1 c
3 P1 d
4 P2 e
5 P1 f
6 P2 g
7 P2 h
8 P1 i