I currently have this dataframe:
id date outcome
3 03/05/2019 no
3 29/05/2019 no
3 04/09/2019 no
3 30/10/2019 yes
3 03/05/2020 no
5 03/12/2019 no
5 26/12/2019 no
5 27/01/2020 yes
5 03/06/2020 yes
6 04/05/2019 no
6 27/10/2019 no
6 26/11/2019 yes
6 28/11/2019 yes
6 29/11/2019 yes
6 13/04/2020 yes
6 14/04/2020 yes
6 24/04/2020 no
6 30/04/2020 no
6 05/05/2020 no
It is grouped based on id and is in ascending order for date.
I want to remove a current row if the row after it has the same outcome. HOWEVER, if an outcome from a row is yes, then the next row must be the FIRST no. This is the desired outcome for the above dataframe:
id date outcome
3 04/09/2019 no
3 30/10/2019 yes
3 03/05/2020 no
5 26/12/2019 no
5 03/06/2020 yes
6 27/10/2019 no
6 14/04/2020 yes
6 24/04/2020 no
At the moment I am doing this:
m1 = (df['outcome'] != df['outcome'].shift()).cumsum()
updated_df = df.groupby([df['id'],m1]).tail(1)
However, this only gives me the last value (yes/no) of a grouped yes/no count. How can I apply a condition in the most pandas way possible?
CodePudding user response:
IIUC, you need two steps. First compute a mask to check whether an outcome is different than the next one (keeping the last), OR follows a yes, everything being done per group. This lead to the filtering you want, except after a yes where you will have a duplicate.(the "after-yes" to keep, and the "last", to discard)
Second, perform again a check of difference of the consecutive outcomes, but keep the first this time.
# step 1
m1 = df['outcome']
m2 = m1.groupby(df['id']).shift(-1)
m3 = m1.groupby(df['id']).shift().eq('yes')&m1.eq('no')
df2 = df[~m1.eq(m2)|m3]
# step 2
m4 = df2['outcome']
m5 = m4.groupby(df['id']).shift()
df2[~m4.eq(m5)]
Output:
id date outcome
2 3 04/09/2019 no
3 3 30/10/2019 yes
4 3 03/05/2020 no
6 5 26/12/2019 no
8 5 03/06/2020 yes
10 6 27/10/2019 no
15 6 14/04/2020 yes
16 6 24/04/2020 no