Existing dataframe :
ID Activity Action No.
A1 register 1
A1 fill form 2
A2 Payment Done 3
A2 fill form 2
B1 fill form 1
B1 Payment Done 2
B1 Process_drop 3
B1 fill form 1
Expected Dataframe :
ID Activity Action No.
A1 register 1
A1 fill form 2
A2 Payment Done 3
B1 fill form 1
B1 Payment Done 2
when ID perform Payment Activity the other rows after the payment done need to be dropped. i tried groupby on ID & applied lambda function on Activity , but stuck with how to proceed
CodePudding user response:
You can do
df['Payment'] = df["Activity"].apply(lambda x: True if x == "Payment Done" else None)
df["AfterPayment"] = df.groupby('ID').ffill()['Payment']
df = df[df["AfterPayment"].isna() | (df["Activity"] == "Payment Done")]
df = df.drop(["Payment", "AfterPayment"], axis=1).reset_index(drop=True)
Explanation
- Add a temporary column stating if the activity is
Payment Done
df['Payment'] = df["Activity"].apply(lambda x: True if x == "Payment Done" else None)
df
ID Activity Action No. Payment
0 A1 register 1 None
1 A1 fill form 2 None
2 A2 Payment Done 3 True
3 A2 fill form 2 None
4 B1 fill form 1 None
5 B1 Payment Done 2 True
6 B1 Process_drop 3 None
7 B1 fill form 1 None
- Add a temporary column using
groupby
andffill
to flag row equal or afterPayment Done
df["AfterPayment"] = df.groupby('ID').ffill()['Payment']
df
ID Activity Action No. Payment AfterPayment
0 A1 register 1 None NaN
1 A1 fill form 2 None NaN
2 A2 Payment Done 3 True True
3 A2 fill form 2 None True
4 B1 fill form 1 None NaN
5 B1 Payment Done 2 True True
6 B1 Process_drop 3 None True
7 B1 fill form 1 None True
- Filter for row that is not after
Payment Done
and drop the temporary columns.
df = df[df["AfterPayment"].isna() | (df["Activity"] == "Payment Done")]
df = df.drop(["Payment", "AfterPayment"], axis=1).reset_index(drop=True)
df
ID Activity Action No.
0 A1 register 1
1 A1 fill form 2
2 A2 Payment Done 3
3 B1 fill form 1
4 B1 Payment Done 2
CodePudding user response:
You can use the following solution:
def my_fun(data):
data = data.reset_index()
if 'Payment Done' in data['Activity'].values:
ind = data['Activity'][data['Activity'] == 'Payment Done'].index[0]
return data.loc[:ind]
else:
return data
df = df.groupby('ID').apply(my_fun).reset_index(drop=True)
df.drop(columns=['index'], inplace=True)
ID Activity Action No.
0 A1 register 1
1 A1 fill form 2
2 A2 Payment Done 3
3 B1 fill form 1
4 B1 Payment Done 2
CodePudding user response:
Given:
ID Activity Action_No
0 A1 register 1
1 A1 fill form 2
2 A2 Payment Done 3
3 A2 fill form 2
4 B1 fill form 1
5 B1 Payment Done 2
6 B1 Process_drop 3
7 B1 fill form 1
Doing:
# For each group, we'll filter values...
# Keeping them all if 'Payment Done' isn't in the group.
# Or if they are or come before the first 'Payment Done' in the group.
df = (df.groupby('ID')
.apply(lambda x: x.loc[x.Activity.ne('Payment Done').all()
| (x.index <= x.Activity.eq('Payment Done').idxmax())])
.reset_index(level=0, drop=True))
print(df)
Output:
ID Activity Action_No
0 A1 register 1
1 A1 fill form 2
2 A2 Payment Done 3
4 B1 fill form 1
5 B1 Payment Done 2
Slight Optimization for newer versions of python:
df = (df.groupby('ID')
.apply(lambda x: x.loc[(y := x.Activity.ne('Payment Done')).all()
| (x.index <= (~y).idxmax())])
.reset_index(level=0, drop=True))