select randomly rows from a dataframe based on a column value-CodePudding

I have a data frame called df of which its value counts are the following:

df.Priority.value_counts()

P3    39506
P2    3038 
P4    1138 
P1    1117 
P5    252  
Name: Priority, dtype: int64

I am trying to create a balanced dataset called df_balanced from df by restricting the number of entries in the P3 category to 5000. The expected output should look like this!

P3    5000
P2    3038 
P4    1138 
P1    1117 
P5    252  
Name: Priority, dtype: int64

I tried the following code:

s0 = df.Priority[df.Priority.eq('P3')].sample(5000).index

df_balanced = df.loc[s0.union(df)].reset_index(drop=True, inplace=True)  # I am unsure how to exclude the entries of `P3` categories from `df`!

I used this as a reference: Randomly selecting rows from a dataframe based on a column value but the solution provided isn't optimal for more than 2 categories.

CodePudding user response：

A possible solution:

import random

# this is the maximum limit of elements of P1, which will be
# randomly chosen
maxlim_catP1 = 4

df.groupby('X').apply(
    lambda g: g.loc[random.sample(g.index.to_list(), min(maxlim_catP1, len(g))), :] if
    (g.loc[g.index[0], 'X'] == 'P1') else g)

Output:

       X  Y
X          
P1 2  P1  c
   3  P1  d
   0  P1  a
   1  P1  b
P2 4  P2  e
   6  P2  g
   7  P2  h

Data:

    X  Y
0  P1  a
1  P1  b
2  P1  c
3  P1  d
4  P2  e
5  P1  f
6  P2  g
7  P2  h
8  P1  i