Home > Software engineering >  Remove duplicate rows but with condition
Remove duplicate rows but with condition

Time:02-15

I have a data frame that looks something like:

df =

date         col1    col2    col3    col4
-----------------------------------------
2022/30/01   2       2       4       5
2022/30/01   2       2       4       5
2022/30/01   0       0       1       2
2022/30/01   0       0       1       2
2022/30/01   3       2       4       2
2022/30/01   5       8       4       3

So basically I have the first two rows being identical, the next two rows also being identical, and the last two different.

What I would like to do is to remove duplicate rows, but only those rows where col1 and col2 equals 0, i.e. the resulting data frame should be:

df_final =

date         col1    col2    col3    col4
-----------------------------------------
2022/30/01   2       2       4       5
2022/30/01   2       2       4       5
2022/30/01   0       0       1       2
2022/30/01   3       2       4       2
2022/30/01   5       8       4       3

Is there any way to accomplish this in an easy way ? I know I could probably do some kind of sorting on the data frame, and then loop through each row and check for conditions. I just suspect that could be a rather time consuming process if there are a lot of rows.

CodePudding user response:

IIUC, this is a simple selection by boolean masks using duplicated to find the duplicated rows and ne all to filter the 0 values:

# is the row not a duplicate?
mask1 = ~df.duplicated()
# are col1 and col2 not both 0?
mask2 = df[['col1', 'col2']].ne(0).all(axis=1)
# then keep the data on either of the above conditions
df2 = df[mask1|mask2]

output:

         date  col1  col2  col3  col4
0  2022/30/01     2     2     4     5
1  2022/30/01     2     2     4     5
2  2022/30/01     0     0     1     2
4  2022/30/01     3     2     4     2
5  2022/30/01     5     8     4     3

Alternative with the reverse operations (eq all)

# Is the row duplicated?
m1 = df.duplicated()
# Are both cols equal to 0?
m2 = df[['col1', 'col2']].eq(0).all(1)
# then keep if not both conditions are met
df[~(m1&m2)]

CodePudding user response:

You can filter in boolean indexing by chain 2 conditions - find rows without 0 rows and then add first duplicated rows by DataFrame.duplicated, by default are tested all columns, for chain is used | for bitwise OR:

df_final = df[df[['col1','col2']].ne(0).all(axis=1) | ~df.duplicated()]
print (df_final)
         date  col1  col2  col3  col4
0  2022/30/01     2     2     4     5
1  2022/30/01     2     2     4     5
2  2022/30/01     0     0     1     2
4  2022/30/01     3     2     4     2
5  2022/30/01     5     8     4     3

Details:

print (df[['col1','col2']].ne(0).all(axis=1))
0     True
1     True
2    False
3    False
4     True
5     True
dtype: bool

print (~df.duplicated())
0     True
1    False
2     True
3    False
4     True
5     True
dtype: bool
  • Related