My sample df
looks like this
id year success
1 2000 N
1 2001 N
1 2002 Y
1 2003 N
1 2004 N
2 2000 N
2 2001 N
2 2002 N
3 2000 N
3 2001 N
3 2002 Y
....
Here, we can see that the id==1
and id==3
has both success==Y
and success==N
but id==2
only has success==N
What I want to do?
- I want to only have
rows
in which if we find the firstsuccess==Y
, we drop the remaining column for that group, egid==1
This is how the new df
should look.
id year success
1 2000 N
1 2001 N
1 2002 Y
2 2000 N
2 2001 N
2 2002 N
3 2000 N
3 2001 N
3 2002 Y
....
Here, in the above df
we removed the extra rows after we encountered success==Y
. Since, id==2
does not have success==Y
, we did not remove any rows and in id==2
, the last row is the Y
so no rows were removed.
What I did?
- I tried to group the
id
but then I want all the results even though I have duplicate ids. So this did not work.
Could someone please help me achieve this result?
CodePudding user response:
First check success
column against value Y
, which gives you True value where success=Y
; To mark all values after first Y
as drop, we can further use cummax
on this condition; Finally use the negated condition to filter the data frame:
df[~df.success.eq('Y').groupby(df.id).apply(lambda g: g.cummax().shift(fill_value=False))]
id year success
0 1 2000 N
1 1 2001 N
2 1 2002 Y
5 2 2000 N
6 2 2001 N
7 2 2002 N
8 3 2000 N
9 3 2001 N
10 3 2002 Y
CodePudding user response:
Try this:
(df.loc[(df.index <= df['id'].map(df['success'].eq('Y').groupby(df['id']).idxmax())) |
df['success'].eq('N').groupby(df['id']).transform('all')])
or
(df.loc[df.groupby('id')['success']
.transform(lambda x: x.eq('Y')
.cumsum()
.shift()
.fillna(0)
.eq(0))])
Output:
id year success
0 1 2000 N
1 1 2001 N
2 1 2002 Y
5 2 2000 N
6 2 2001 N
7 2 2002 N
8 3 2000 N
9 3 2001 N
10 3 2002 Y