Home > database >  Pandas Dataframe - Drop certain rows if values match those of other rows (with performance in mind)
Pandas Dataframe - Drop certain rows if values match those of other rows (with performance in mind)

Time:07-16

Set-up:

I have a dataframe that has multiple columns for each row of data that I need to consider. I wish to drop all rows from this dataframe if they are deemed "FALSE" by the "Valid" column.

At the same time I also wish to remove any other rows that are associated with those dropped rows if they share the same "Start" and "ID" values.

I'm also constrained by not being able to use much looping. I need to be able to process 300k rows and while I was able to implement a solution using a nested for loop, python is very slow and it takes over a day to complete on my machine.

Example:

      Start     ID    Valid
0     10        0     FALSE  #(Should be dropped as its labelled "FALSE")
1     10        0     TRUE   #(Should be dropped as even though its labelled "TRUE" it shares the same "Start" and "ID" values as above)
2     10        1     TRUE   #(Should not be dropped as it has a different "ID" value)
3     12        2     TRUE   #(Should not be dropped as its labelled "TRUE" and doesn't share the same "Start" or "ID" values as any "FALSE" rows.)

Expected Outcome:

      Start     ID    Valid 
2     10        1     TRUE   
3     12        2     TRUE   

Code:

After realizing looping was too inefficient I found out about the "isin" function which is much faster. It didn't take me too long to figure out how to remove rows based on if they're not "Valid" and share the same "Start" time but my method is too aggressive. It will remove all values even if they are of a different "ID" as I'm not sure how to perform a check like that efficiently using "isin" or something similarly more efficient than a loop.

df['drop'] = 'False'  #create new column to keep track of what I want to drop
df['drop'] = (df["Start"].isin(df.loc[(df.Valid == False), "Start"]))
df = df.loc[df['drop'] != True] 

Current Outcome:

      Start     ID    Valid    
3     12        2     TRUE   

What would be a potential solution that wouldn't involve a loop/significantly affect performance?

CodePudding user response:

Try this:

df.loc[~(df.duplicated(subset=['Start', 'ID'], keep=False)) & (df.Valid)]

Output:

    Start   ID  Valid
2   10  1   True
3   12  2   True

CodePudding user response:

Try this:

df[df.groupby(['Start', 'ID'])['Valid'].transform(all)]

Output:

   Start  ID  Valid
2     10   1   True
3     12   2   True
  • Related