I have the following dataframe:
. | id | outcome |
---|---|---|
0 | 3 | no |
1 | 3 | no |
2 | 3 | no |
3 | 3 | yes |
4 | 3 | no |
5 | 5 | no |
6 | 5 | no |
7 | 5 | yes |
8 | 5 | yes |
9 | 6 | no |
10 | 6 | no |
11 | 6 | yes |
12 | 6 | yes |
13 | 6 | yes |
14 | 6 | yes |
15 | 6 | yes |
16 | 6 | no |
17 | 6 | no |
I would like to delete all rows of 'yes' if they are the last 'yes' in the outcome column.
I would also like to drop all 'no' if they are the first values in the dataframe
These must be grouped by the 'id' column
This should be the output:
. | id | outcome |
---|---|---|
3 | 3 | yes |
4 | 3 | no |
11 | 6 | yes |
12 | 6 | yes |
13 | 6 | yes |
14 | 6 | yes |
15 | 6 | yes |
16 | 6 | no |
17 | 6 | no |
At the moment I have tried this:
df = pd.DataFrame(data={
'id': [3, 3, 3, 3, 3, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6],
'outcome': ['no', 'no', 'no', 'yes', 'no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no', 'no']
})
m1 = df.groupby(['id'])['outcome'].head() != 'yes'
df = df.drop(m1[m1].index)
m2 = df.groupby(['id'])['outcome'].tail() != 'no'
df = df.drop(m2[m2].index)
print(df)
If I put a 1 in head()
and tail()
, this just removes the last value and not the preceding values. Any suggestions?
CodePudding user response:
You need to compute masks and slice. In summary, I computed here the rank of each stretch of yes/no to determine if they are initial (= rank 1) of final (=max rank per group).
o = df['outcome']
g = df.groupby('id')['outcome']
m1 = o.ne(g.shift()).groupby(df['id']).cumsum()
m2 = m1.groupby(df['id']).transform('max')
df[~((m1.eq(1)&o.eq('no'))|(m1.eq(m2)&o.eq('yes')))]
Output:
id outcome
3 3 yes
4 3 no
11 6 yes
12 6 yes
13 6 yes
14 6 yes
15 6 yes
16 6 no
17 6 no
NB. The final mask used in slicing could be simplified using boolean arithmetics, but I left it as is for clarity on the conditions