I have a pandas dataframe that looks as such:
header1 | header2 | header3 |
---|---|---|
val1 | 1 | True |
val2 | 2 | True |
val1 | 1 | True |
val2 | 2 | False |
and I want to conditionally deduplicate the rows, based upon if the last column is True
or not.
In the above table I want the output dataframe after drops to look like this:
header1 | header2 | header3 |
---|---|---|
val1 | 1 | True |
val2 | 2 | True |
val2 | 2 | False |
having deduplicated the third row, as it had header 3 set to true, but not the fourth row, as it had header 3 set to false.
I haven't yet been able to figure out how to do this, as drop_duplicates
doesn't support this explicitly and it's not generally good practice to iterate through an entire dataframe (although I will if I have to).
CodePudding user response:
Use:
In [606]: ix = df[df['header3']].drop_duplicates().index
In [610]: df = pd.concat([df.loc[ix], df[~df['header3']]])
In [611]: df
Out[611]:
header1 header2 header3
0 val1 1 True
1 val2 2 True
3 val2 2 False
CodePudding user response:
Assuming you want to keep all duplicated rows if header is False, this should work
# flag the row if header3 is not False or duplicated
df = df[~df['header3'] | ~df.drop(columns='header3').duplicated()]
print(df)
header1 header2 header3
0 val1 1 True
1 val2 2 True
3 val2 2 False