Home > Blockchain >  Order each group by one column and get top elements depending on each group
Order each group by one column and get top elements depending on each group

Time:11-18

I have a pandas dataframe with the following columns:

month userID probability top_users_that_month
9 A 5.231 3
9 B 0.963 3
9 E 7.445 3
9 D 9.343 3
9 C 3.231 3
10 X 0.963 1
10 B 7.445 1
10 A 2.343 1
13 Y 1.963 2
13 B 9.445 2
13 D 0.343 2

I would like to create a flag column, which has values 1 or 0, depending on if the userID was among the top_users (according to their probability column) for each month. Note that the number of top users in each month is indicated by the top_users_that_month column.

The desired output would be the following:

month userID probability top_users_that_month flag_among_top_users
9 A 5.231 3 1
9 B 0.963 3 0
9 E 7.445 3 1
9 D 9.343 3 1
9 C 3.231 3 0
10 X 0.963 1 0
10 B 7.445 1 1
10 A 2.343 1 0
11 Y 1.963 2 1
11 B 9.445 2 1
11 D 0.343 2 0

I have tried to create the column using groupby and apply but didn't get the result I wanted.

CodePudding user response:

Since the number of users to be marked differ for each group, the easiest approach would be to use a custom function using groupby and apply.

For example:

def get_top_users(df):
    top_users = df['top_users_that_month'].iloc[0]
    idx = df.sort_values('probability', ascending=False).head(top_users).index
    df['flag_among_top_users'] = 0
    df.loc[idx, 'flag_among_top_users'] = 1
    return df
    
df.groupby('month').apply(get_top_users)

Result:

    month userID  probability  top_users_that_month  flag_among_top_users
0       9      A        5.231                     3                     1
1       9      B        0.963                     3                     0
2       9      E        7.445                     3                     1
3       9      D        9.343                     3                     1
4       9      C        3.231                     3                     0
5      10      X        0.963                     1                     0
6      10      B        7.445                     1                     1
7      10      A        2.343                     1                     0
8      13      Y        1.963                     2                     1
9      13      B        9.445                     2                     1
10     13      D        0.343                     2                     0

It is also possible to use nlargest instead of sorting inside get_top_users, but there could be issues if there are two identical probability values.

CodePudding user response:

treshold = 5
df['flag_among_top_users'] = (df['probability'] > treshold).astype(int)

Output:

month userID probability top_users_that_month flag_among_top_users
0 9 A 5.231 3 1
1 9 B 0.963 3 0
2 9 E 7.445 3 1
3 9 D 9.343 3 1
4 9 C 3.231 3 0
5 10 X 0.963 1 0
6 10 B 7.445 1 1
7 10 A 2.343 1 0
8 13 Y 1.963 2 0
9 13 B 9.445 2 1
10 13 D 0.343 2 0
  • Related