Home > database >  How to remove specific rows from a dataframe satisfying multiple conditions (python pandas)?
How to remove specific rows from a dataframe satisfying multiple conditions (python pandas)?

Time:11-17

I have the following DataFrame:

    id outcome
0    3      no
1    3      no
2    3      no
3    3     yes
4    3      no
5    5      no
6    5      no
7    5     yes
8    5     yes
9    6      no
10   6      no
11   6     yes
12   6     yes
13   6     yes
14   6     yes
15   6     yes
16   6      no
17   6      no
18   6      no
19   7      no
20   7      no
21   7     yes
22   7     yes
23   7      no
24   7      no
25   7      no
26   7      yes

It is grouped based on id.

There are a few conditions I need to satisfy.

I need to remove a current row if the row after it has the same outcome.

If a row is 'yes', then the next row must be the FIRST 'no'.

I must also have the LAST 'yes' row in a sequence of 'yes'.

Additionally, I also want to keep the LAST 'no' above a 'yes' (so there could be 2 'no' values above a 'yes': basically in a row of no's the first and last 'no's).

I then need to remove any 'yes' rows that are the last rows.

And the end of this, if an 'id' column has only a 'no' row, then it must be removed too.

This should be the output.

    id outcome

2    3      no
3    3     yes
4    3      no
10   6      no
15   6     yes
16   6      no
20   7      no
22   7     yes
23   7      no
25   7      no
26   7      yes

I am currently doing this:

df = pd.DataFrame(data={'id':[3,3,3,3,3,5,5,5,5,6,6,6,6,6,6,6,6,6,6,7,7,7,7,7], 
     'outcome': ['no','no','no','yes','no','no','no','yes','yes','no','no','yes','yes','yes','yes','yes','no','no','no', 'no', 'yes', 'no', 'no', 'yes']})



# part 1
g = df.groupby('id')['outcome']
m1 = g.shift().eq('yes') | g.shift(-1).eq('yes')
df = df[m1 & df.outcome.ne('yes') | (df.outcome.eq('yes') & g.shift().ne('yes') ) ]

# part 2
# The following removes any last rows that are a 'yes' per id
m2 = df.groupby(['id'])['outcome'].tail(1) != 'no'
df = df.drop(m2[m2].index)

#part 3
# The following removes any id counts that are one, as the last row 'yes' values should be removed, this would mean only 'no' rows are leftover
df_count = df.groupby(['id'])['outcome'].count().to_frame('count').reset_index() 
df = pd.merge(df, df_count[['id','count']] , on=['id'], how='inner') 
df = df.drop((df[df.count == 1].index))

However, part 1 keeps the first 'yes' value row and not the last 'yes' as I need.

I'm also unsure if parts 2 and 3 are unnecessarily verbose and whether I can do something more streamlined to satisfy all the aforementioned conditions.

CodePudding user response:

So I have this to work with for part 1 of my code, but it's not efficient and takes an incredibly long time to run. This is in comparison to the alternative (but slightly incorrect) answers that were quick to execute (seconds).

It's worth mentioning that I am running this code on 70k rows.


index_to_remove = list()
data = df.groupby('id')['outcome'].apply(list).to_dict()

count = 0
for key,value in data.items():
    for i in range(len(value)):
        if i == len(value)-1:
            count =count   1
            continue
        if data[key][i] == "yes" and data[key][i 1] == "yes":
            index_to_remove.append(count)
            count =count   1
            continue
        if i == 0 and data[key][i]=="no" and data[key][i 1] == "no":
            index_to_remove.append(count)
            count =count   1
            continue
        elif data[key][i] == "no" and i == len(value) - 3:
            if data[key][i 1] == "no" and data[key][i 2] == "no":
                index_to_remove.append(count)
                count =count   1
                continue
                
        elif data[key][i] == "no" and i == len(value)-2:
            if data[key][i 1] == "no":
                index_to_remove.append(count)
                count =count   1
                continue
        elif data[key][i] == "no" and i == len(value)-1:
            count =count   1
            continue
        elif data[key][i] == "no" and data[key][i 1] == "no":
            if data[key][i-1] == "no" or data[key][i 2] == "no":
                index_to_remove.append(count)
                count =count   1
                continue
            
        
        count = count   1

for index in index_to_remove:
    df = df.drop(index)

print(df)

  • Related