I have a dataframe df
:-
ID | Date | Event |
---|---|---|
1 | 30-10-2013 | Success |
1 | 08-11-2013 | Success |
1 | 06-12-2013 | Success |
1 | 24-02-2014 | Click |
1 | 24-02-2014 | Form |
1 | 04-03-2014 | |
1 | 15-04-2014 | Success |
1 | 16-04-2014 | Click |
1 | 17-05-2014 | Success |
1 | 21-06-2014 | |
1 | 01-07-2014 | Webpage |
1 | 03-07-2014 | |
2 | 05-07-2014 | Form |
2 | 06-08-2014 | Webpage |
2 | 07-09-2014 | Success |
I want to remove rows which have Event
Success
if the Event
starts with Success
for each ID
(sorted in chronological order) and also remove the events(rows) after the last Success
Event for each ID
.
Expected :-
ID | Date | Event |
---|---|---|
1 | 24-02-2014 | Click |
1 | 24-02-2014 | Form |
1 | 04-03-2014 | |
1 | 15-04-2014 | Success |
1 | 16-04-2014 | Click |
1 | 17-05-2014 | Success |
2 | 05-07-2014 | Form |
2 | 06-08-2014 | Webpage |
2 | 07-09-2014 | Success |
CodePudding user response:
Provided the dataframe is already sorted, this should work:
df["n"] = df.groupby("ID")["Event"].transform(lambda x: (x == "Success").shift(1, fill_value=0).cumsum())
df["keep"] = df.groupby(["ID", "n"])["Event"].transform(lambda x: (len(x) > 1) & (x.iloc[-1] == "Success"))
result = df.loc[df["keep"]].drop(columns=["keep", "n"])
A bit of explanation:
- "n" numbers a group of rows containing one "Success", using this trick: https://www.codeforests.com/2021/03/30/group-consecutive-rows-in-pandas/
- "keep" creates a filter based on row group containing more than 1 row (not only a single "Success") and the last row being "Success"