Removing duplicates based on matching column values with boolean indexing-CodePudding

After merging two DF's I have the following dataset:

DB_ID	x_val	y_val
x01	405	407
x01	405	405
x02	308	306
x02	308	308
x03	658	658
x03	658	660
x04	None	658
x04	None	660
x05	658	660
x06	660	660

The y table contains multiple values for the left join variable (not included in table), resulting in multiple rows per unique DB_ID (string variable, not in df index).

The issue is that only one row is correct, where x_val and y_val match. I tried removing the duplicates with the following code: df= df[~df['DB_ID'].duplicated() | combined['x_val'] != combined['y_val']]

This however doesn't work. I am looking for a solution to achieve the following result:

DB_ID	x_val	y_val
x01	405	405
x02	308	308
x03	658	658
x04	None	658
x05	658	660
x06	660	660

CodePudding user response：

Idea is compare both column for not equal, then sorting and reove duplicates by DB_ID:

df = (df.assign(new = df['x_val'].ne(df['y_val']))
        .sort_values(['DB_ID','new'])
        .drop_duplicates('DB_ID')
        .drop('new', axis=1))
print (df)
  DB_ID x_val y_val
1   x01   405   405
3   x02   308   308
4   x03   658   658
6   x04  None   658
8   x05   658   660
9   x06   660   660

If need equal NaNs or Nones use:

df = (df.assign(new = df['x_val'].fillna('same').ne(df['y_val'].fillna('same')))
        .sort_values(['DB_ID','new'])
        .drop_duplicates('DB_ID')
        .drop('new', axis=1))

CodePudding user response：

Maybe, you can simply use:

df = df[df['x_val'] == df['y_val']]
print(df)

# Output
  DB_ID  x_val  y_val
1   x01    405    405
3   x02    308    308
4   x03    658    658

I think you don't need drop_duplicates or duplicated but if you want to ensure there remains only one instance of each DB_ID, you can append .drop_duplicates('DB_ID')

df = df[df['x_val'] == df['y_val']].drop_duplicates('DB_ID')
print(df)

# Output
  DB_ID  x_val  y_val
1   x01    405    405
3   x02    308    308
4   x03    658    658