Home > Mobile >  Drop consecutive duplicate rows based on condition
Drop consecutive duplicate rows based on condition

Time:11-07

I currently have this dataframe:

id  date       outcome
3   03/05/2019  no
3   29/05/2019  no
3   04/09/2019  no
3   30/10/2019  yes
3   03/05/2020  no
5   03/12/2019  no
5   26/12/2019  no
5   27/01/2020  yes
5   03/06/2020  yes
6   04/05/2019  no
6   27/10/2019  no
6   26/11/2019  yes
6   28/11/2019  yes
6   29/11/2019  yes
6   13/04/2020  yes
6   14/04/2020  yes
6   24/04/2020  no
6   30/04/2020  no
6   05/05/2020  no

It is grouped based on id and is in ascending order for date.

I want to remove a current row if the row after it has the same outcome. HOWEVER, if an outcome from a row is yes, then the next row must be the FIRST no. This is the desired outcome for the above dataframe:

id  date       outcome
3   04/09/2019  no
3   30/10/2019  yes
3   03/05/2020  no
5   26/12/2019  no
5   03/06/2020  yes
6   27/10/2019  no
6   14/04/2020  yes
6   24/04/2020  no

At the moment I am doing this:

m1 = (df['outcome'] != df['outcome'].shift()).cumsum()
updated_df = df.groupby([df['id'],m1]).tail(1)

However, this only gives me the last value (yes/no) of a grouped yes/no count. How can I apply a condition in the most pandas way possible?

CodePudding user response:

IIUC, you need two steps. First compute a mask to check whether an outcome is different than the next one (keeping the last), OR follows a yes, everything being done per group. This lead to the filtering you want, except after a yes where you will have a duplicate.(the "after-yes" to keep, and the "last", to discard)

Second, perform again a check of difference of the consecutive outcomes, but keep the first this time.

# step 1
m1 = df['outcome']
m2 = m1.groupby(df['id']).shift(-1)
m3 = m1.groupby(df['id']).shift().eq('yes')&m1.eq('no')

df2 = df[~m1.eq(m2)|m3]

# step 2
m4 = df2['outcome']
m5 = m4.groupby(df['id']).shift()
df2[~m4.eq(m5)]

Output:

    id        date outcome
2    3  04/09/2019      no
3    3  30/10/2019     yes
4    3  03/05/2020      no
6    5  26/12/2019      no
8    5  03/06/2020     yes
10   6  27/10/2019      no
15   6  14/04/2020     yes
16   6  24/04/2020      no
  • Related