When removing duplicates, can I keep those rows that match a condition? Instead of doing:
df.remove_duplicates(subset=['x','y'], keep='first']
do:
df.remove_duplicates(subset=['x','y'], keep=df.loc[df[column]=='String'])
Suppose I have a df like:
A B
1 'Hi'
1 'Bye'
Keep the rows with 'Hi'. I want to do it this way because it would be more handful since I am going to introduce multiple conditions in the process
CodePudding user response:
Use DataFrame.duplicated
with invert mask and chain by &
for bitwise AND
by condition:
df['mask'] = ~df.duplicated(subset=['A','B']) & (df['B']=='Hi')
print (df)
A B mask
0 1 Hi True
1 1 Bye False
2 1 Hi False
3 1 Bye False
Tested with duplciated index and working perfectly:
df.index = [0] * 4
df['mask'] = ~df.duplicated(subset=['A','B']) & (df['B']=='Hi')
print (df)
A B mask
0 1 Hi True
0 1 Bye False
0 1 Hi False
0 1 Bye False