Home > Enterprise >  select rows based on multi conditions with keep or change group number- Dataframe
select rows based on multi conditions with keep or change group number- Dataframe

Time:01-25

I have a pandas DataFrame (df)

import pandas as pd
import numpy as np

    df=pd.DataFrame({'user': ['user 1', 'user 2', 'user 3', 'user 4', 'user 1', 'user 2', 'user 3', 'user 4'],
                     'group': [0, 0, 0, 0, 1, 1, 1, 1],
                    'p1': [0.759, 1.106, 1.619, 1.260, 0.540, 1.437, 1.440, 1.332],
                    'p2': [0.9, 0.9, 0.9, 0.9, 0.7, 0.7, 0.7, 0.7],})
    df

output:

    user  group  p1     p2
0   user 1  0   0.759   0.9
1   user 2  0   1.106   0.9
2   user 3  0   1.619   0.9
3   user 4  0   1.260   0.9
4   user 1  1   0.540   0.7
5   user 2  1   1.437   0.7
6   user 3  1   1.440   0.7
7   user 4  1   1.332   0.7

I want to return each user with a condition if p1 is below p2 then return this row and if there is no row that meets this condition when p1 is below p2 then return this user with a change group number to a new group number (a random number which not in group list).

For example: for the user1, row number 4 should be selected since it returns a min value of p1 below p2 with group number 1, and even row 0 meet this condition but still, row 4 has a min value of p1. For users 2, 3, and 4, all p1 is higher than p2 for all rows, so we should change the group number to a new value.

I used the following code but it change the group number to the max number of the group numbers (here 2).

mylist=df['group'].values.tolist()
lst = list(set(mylist))

df2 = (df[df['p1'].lt(df['p2'])]
           .set_index('group')
           .groupby('user')['p1']
           .idxmin()
           .reindex(df['user'].unique(), fill_value=max(set(lst)) 1)
           .reset_index(name='group'))
df2

output:

    user   group
0   user 1  1
1   user 2  2
2   user 3  2
3   user 4  2

The expected output: when the condition is not met (p1 is higher than p2) replace the group number with a random number that is not in the group number list (her group list=[0,1])

enter image description here

CodePudding user response:

You can manipulate the group numbers before groupby("user"):

  • If a row has p1 < p2, keep the same group
  • Otherwise, change it to an arbitrary number that does not appear in the original data, and also unique within the output.

For the second part, since we don't care about the new group number, we can simply take df["group"] df["group"].max() [1, 2, 3, ...]:

s = np.where(df["p1"] < df["p2"], 0, df["group"].max()   np.arange(len(df))   1)
result = (
    df.assign(new_group=df["group"]   s)
    .sort_values(["user", "p1"])
    .groupby("user")
    .head(1)
)

Result:

     user  group     p1   p2  new_group
4  user 1      1  0.540  0.7          1
1  user 2      0  1.106  0.9          3
6  user 3      1  1.440  0.7          9
3  user 4      0  1.260  0.9          5

Trim the columns as needed.

CodePudding user response:

You can use:

# Create a replacement group for each user, start for group max
new_group = np.arange(df['user'].nunique())   df['group'].max()   1

# Keep one instance of user (one have the most probability to satisfy conditions)
out = (df.assign(dp=lambda x: x['p1'] - x['p2'])
         .sort_values(['dp', 'p1'])
         .drop_duplicates('user'))

# Set new group if needed else keep original group
out['group'] = np.where(out['dp'] < 0, out['group'], new_group)

Output:

>>> out
     user  group     p1   p2     dp
4  user 1      1  0.540  0.7 -0.160
1  user 2      3  1.106  0.9  0.206
3  user 4      4  1.260  0.9  0.360
2  user 3      5  1.619  0.9  0.719
  • Related