Home > Software engineering >  Removing duplicates based on matching column values with boolean indexing
Removing duplicates based on matching column values with boolean indexing

Time:03-30

After merging two DF's I have the following dataset:

DB_ID x_val y_val
x01 405 407
x01 405 405
x02 308 306
x02 308 308
x03 658 658
x03 658 660
x04 None 658
x04 None 660
x05 658 660
x06 660 660

The y table contains multiple values for the left join variable (not included in table), resulting in multiple rows per unique DB_ID (string variable, not in df index).

The issue is that only one row is correct, where x_val and y_val match. I tried removing the duplicates with the following code: df= df[~df['DB_ID'].duplicated() | combined['x_val'] != combined['y_val']]

This however doesn't work. I am looking for a solution to achieve the following result:

DB_ID x_val y_val
x01 405 405
x02 308 308
x03 658 658
x04 None 658
x05 658 660
x06 660 660

CodePudding user response:

Idea is compare both column for not equal, then sorting and reove duplicates by DB_ID:

df = (df.assign(new = df['x_val'].ne(df['y_val']))
        .sort_values(['DB_ID','new'])
        .drop_duplicates('DB_ID')
        .drop('new', axis=1))
print (df)
  DB_ID x_val y_val
1   x01   405   405
3   x02   308   308
4   x03   658   658
6   x04  None   658
8   x05   658   660
9   x06   660   660

If need equal NaNs or Nones use:

df = (df.assign(new = df['x_val'].fillna('same').ne(df['y_val'].fillna('same')))
        .sort_values(['DB_ID','new'])
        .drop_duplicates('DB_ID')
        .drop('new', axis=1))

CodePudding user response:

Maybe, you can simply use:

df = df[df['x_val'] == df['y_val']]
print(df)

# Output
  DB_ID  x_val  y_val
1   x01    405    405
3   x02    308    308
4   x03    658    658

I think you don't need drop_duplicates or duplicated but if you want to ensure there remains only one instance of each DB_ID, you can append .drop_duplicates('DB_ID')

df = df[df['x_val'] == df['y_val']].drop_duplicates('DB_ID')
print(df)

# Output
  DB_ID  x_val  y_val
1   x01    405    405
3   x02    308    308
4   x03    658    658
  • Related