Home > database >  Conditionally dropping duplicates using pandas
Conditionally dropping duplicates using pandas

Time:05-19

I have a pandas dataframe that looks as such:

header1 header2 header3
val1 1 True
val2 2 True
val1 1 True
val2 2 False

and I want to conditionally deduplicate the rows, based upon if the last column is True or not.

In the above table I want the output dataframe after drops to look like this:

header1 header2 header3
val1 1 True
val2 2 True
val2 2 False

having deduplicated the third row, as it had header 3 set to true, but not the fourth row, as it had header 3 set to false.

I haven't yet been able to figure out how to do this, as drop_duplicates doesn't support this explicitly and it's not generally good practice to iterate through an entire dataframe (although I will if I have to).

CodePudding user response:

Use:

In [606]: ix = df[df['header3']].drop_duplicates().index
In [610]: df = pd.concat([df.loc[ix], df[~df['header3']]])

In [611]: df
Out[611]: 
  header1  header2  header3
0    val1        1     True
1    val2        2     True
3    val2        2    False

CodePudding user response:

Assuming you want to keep all duplicated rows if header is False, this should work

# flag the row if header3 is not False or duplicated
df = df[~df['header3'] | ~df.drop(columns='header3').duplicated()]
print(df)
  header1  header2  header3
0    val1        1     True
1    val2        2     True
3    val2        2    False
  • Related