I have a pandas dataframe with the following columns:
month | userID | probability | top_users_that_month |
---|---|---|---|
9 | A | 5.231 | 3 |
9 | B | 0.963 | 3 |
9 | E | 7.445 | 3 |
9 | D | 9.343 | 3 |
9 | C | 3.231 | 3 |
10 | X | 0.963 | 1 |
10 | B | 7.445 | 1 |
10 | A | 2.343 | 1 |
13 | Y | 1.963 | 2 |
13 | B | 9.445 | 2 |
13 | D | 0.343 | 2 |
I would like to create a flag column, which has values 1 or 0, depending on if the userID was among the top_users (according to their probability
column) for each month. Note that the number of top users in each month is indicated by the top_users_that_month
column.
The desired output would be the following:
month | userID | probability | top_users_that_month | flag_among_top_users |
---|---|---|---|---|
9 | A | 5.231 | 3 | 1 |
9 | B | 0.963 | 3 | 0 |
9 | E | 7.445 | 3 | 1 |
9 | D | 9.343 | 3 | 1 |
9 | C | 3.231 | 3 | 0 |
10 | X | 0.963 | 1 | 0 |
10 | B | 7.445 | 1 | 1 |
10 | A | 2.343 | 1 | 0 |
11 | Y | 1.963 | 2 | 1 |
11 | B | 9.445 | 2 | 1 |
11 | D | 0.343 | 2 | 0 |
I have tried to create the column using groupby
and apply
but didn't get the result I wanted.
CodePudding user response:
Since the number of users to be marked differ for each group, the easiest approach would be to use a custom function using groupby
and apply
.
For example:
def get_top_users(df):
top_users = df['top_users_that_month'].iloc[0]
idx = df.sort_values('probability', ascending=False).head(top_users).index
df['flag_among_top_users'] = 0
df.loc[idx, 'flag_among_top_users'] = 1
return df
df.groupby('month').apply(get_top_users)
Result:
month userID probability top_users_that_month flag_among_top_users
0 9 A 5.231 3 1
1 9 B 0.963 3 0
2 9 E 7.445 3 1
3 9 D 9.343 3 1
4 9 C 3.231 3 0
5 10 X 0.963 1 0
6 10 B 7.445 1 1
7 10 A 2.343 1 0
8 13 Y 1.963 2 1
9 13 B 9.445 2 1
10 13 D 0.343 2 0
It is also possible to use nlargest
instead of sorting inside get_top_users
, but there could be issues if there are two identical probability
values.
CodePudding user response:
treshold = 5
df['flag_among_top_users'] = (df['probability'] > treshold).astype(int)
Output:
month | userID | probability | top_users_that_month | flag_among_top_users | |
---|---|---|---|---|---|
0 | 9 | A | 5.231 | 3 | 1 |
1 | 9 | B | 0.963 | 3 | 0 |
2 | 9 | E | 7.445 | 3 | 1 |
3 | 9 | D | 9.343 | 3 | 1 |
4 | 9 | C | 3.231 | 3 | 0 |
5 | 10 | X | 0.963 | 1 | 0 |
6 | 10 | B | 7.445 | 1 | 1 |
7 | 10 | A | 2.343 | 1 | 0 |
8 | 13 | Y | 1.963 | 2 | 0 |
9 | 13 | B | 9.445 | 2 | 1 |
10 | 13 | D | 0.343 | 2 | 0 |