Home > Software engineering >  filtering the dataframe based on groupby when meeting specific condition
filtering the dataframe based on groupby when meeting specific condition

Time:09-09

Existing dataframe :

ID       Activity     Action No.     
A1       register         1
A1       fill form        2
A2       Payment Done     3
A2       fill form        2
B1       fill form        1
B1       Payment Done     2
B1       Process_drop     3
B1       fill form        1

Expected Dataframe :

ID       Activity     Action No.     
A1       register         1
A1       fill form        2
A2       Payment Done     3
B1       fill form        1
B1       Payment Done     2

when ID perform Payment Activity the other rows after the payment done need to be dropped. i tried groupby on ID & applied lambda function on Activity , but stuck with how to proceed

CodePudding user response:

You can do

df['Payment'] = df["Activity"].apply(lambda x: True if x == "Payment Done" else None)
df["AfterPayment"] = df.groupby('ID').ffill()['Payment']
df = df[df["AfterPayment"].isna() | (df["Activity"] == "Payment Done")]
df = df.drop(["Payment", "AfterPayment"], axis=1).reset_index(drop=True)

Explanation

  1. Add a temporary column stating if the activity is Payment Done
df['Payment'] = df["Activity"].apply(lambda x: True if x == "Payment Done" else None)
df

   ID      Activity  Action No. Payment
0  A1      register           1    None
1  A1     fill form           2    None
2  A2  Payment Done           3    True
3  A2     fill form           2    None
4  B1     fill form           1    None
5  B1  Payment Done           2    True
6  B1  Process_drop           3    None
7  B1     fill form           1    None
  1. Add a temporary column using groupby and ffill to flag row equal or after Payment Done
df["AfterPayment"] = df.groupby('ID').ffill()['Payment']
df

   ID      Activity  Action No. Payment AfterPayment
0  A1      register           1    None          NaN
1  A1     fill form           2    None          NaN
2  A2  Payment Done           3    True         True
3  A2     fill form           2    None         True
4  B1     fill form           1    None          NaN
5  B1  Payment Done           2    True         True
6  B1  Process_drop           3    None         True
7  B1     fill form           1    None         True
  1. Filter for row that is not after Payment Done and drop the temporary columns.
df = df[df["AfterPayment"].isna() | (df["Activity"] == "Payment Done")]
df = df.drop(["Payment", "AfterPayment"], axis=1).reset_index(drop=True)
df

   ID      Activity  Action No.
0  A1      register           1
1  A1     fill form           2
2  A2  Payment Done           3
3  B1     fill form           1
4  B1  Payment Done           2

CodePudding user response:

You can use the following solution:

def my_fun(data):
    data = data.reset_index()
    if 'Payment Done' in data['Activity'].values:
        ind = data['Activity'][data['Activity'] == 'Payment Done'].index[0]
        return data.loc[:ind]
    else:
        return data


df = df.groupby('ID').apply(my_fun).reset_index(drop=True)
df.drop(columns=['index'], inplace=True)


   ID      Activity  Action No.
0  A1      register           1
1  A1     fill form           2
2  A2  Payment Done           3
3  B1     fill form           1
4  B1  Payment Done           2

CodePudding user response:

Given:

   ID      Activity  Action_No
0  A1      register          1
1  A1     fill form          2
2  A2  Payment Done          3
3  A2     fill form          2
4  B1     fill form          1
5  B1  Payment Done          2
6  B1  Process_drop          3
7  B1     fill form          1

Doing:

# For each group, we'll filter values...
    # Keeping them all if 'Payment Done' isn't in the group.
    # Or if they are or come before the first 'Payment Done' in the group.
df = (df.groupby('ID')
        .apply(lambda x: x.loc[x.Activity.ne('Payment Done').all() 
                              | (x.index <= x.Activity.eq('Payment Done').idxmax())])
        .reset_index(level=0, drop=True))
print(df)

Output:

   ID      Activity  Action_No
0  A1      register          1
1  A1     fill form          2
2  A2  Payment Done          3
4  B1     fill form          1
5  B1  Payment Done          2

Slight Optimization for newer versions of python:

df = (df.groupby('ID')
        .apply(lambda x: x.loc[(y := x.Activity.ne('Payment Done')).all() 
                              | (x.index <= (~y).idxmax())])
        .reset_index(level=0, drop=True))
  • Related