I have a data frame that looks something like:
df =
date col1 col2 col3 col4
-----------------------------------------
2022/30/01 2 2 4 5
2022/30/01 2 2 4 5
2022/30/01 0 0 1 2
2022/30/01 0 0 1 2
2022/30/01 3 2 4 2
2022/30/01 5 8 4 3
So basically I have the first two rows being identical, the next two rows also being identical, and the last two different.
What I would like to do is to remove duplicate rows, but only those rows where col1
and col2
equals 0, i.e. the resulting data frame should be:
df_final =
date col1 col2 col3 col4
-----------------------------------------
2022/30/01 2 2 4 5
2022/30/01 2 2 4 5
2022/30/01 0 0 1 2
2022/30/01 3 2 4 2
2022/30/01 5 8 4 3
Is there any way to accomplish this in an easy way ? I know I could probably do some kind of sorting on the data frame, and then loop through each row and check for conditions. I just suspect that could be a rather time consuming process if there are a lot of rows.
CodePudding user response:
IIUC, this is a simple selection by boolean masks using duplicated
to find the duplicated rows and ne
all
to filter the 0 values:
# is the row not a duplicate?
mask1 = ~df.duplicated()
# are col1 and col2 not both 0?
mask2 = df[['col1', 'col2']].ne(0).all(axis=1)
# then keep the data on either of the above conditions
df2 = df[mask1|mask2]
output:
date col1 col2 col3 col4
0 2022/30/01 2 2 4 5
1 2022/30/01 2 2 4 5
2 2022/30/01 0 0 1 2
4 2022/30/01 3 2 4 2
5 2022/30/01 5 8 4 3
Alternative with the reverse operations (eq
all
)
# Is the row duplicated?
m1 = df.duplicated()
# Are both cols equal to 0?
m2 = df[['col1', 'col2']].eq(0).all(1)
# then keep if not both conditions are met
df[~(m1&m2)]
CodePudding user response:
You can filter in boolean indexing
by chain 2 conditions - find rows without 0
rows and then add first duplicated rows by DataFrame.duplicated
, by default are tested all columns, for chain is used |
for bitwise OR
:
df_final = df[df[['col1','col2']].ne(0).all(axis=1) | ~df.duplicated()]
print (df_final)
date col1 col2 col3 col4
0 2022/30/01 2 2 4 5
1 2022/30/01 2 2 4 5
2 2022/30/01 0 0 1 2
4 2022/30/01 3 2 4 2
5 2022/30/01 5 8 4 3
Details:
print (df[['col1','col2']].ne(0).all(axis=1))
0 True
1 True
2 False
3 False
4 True
5 True
dtype: bool
print (~df.duplicated())
0 True
1 False
2 True
3 False
4 True
5 True
dtype: bool