After merging two DF's I have the following dataset:
DB_ID | x_val | y_val |
---|---|---|
x01 | 405 | 407 |
x01 | 405 | 405 |
x02 | 308 | 306 |
x02 | 308 | 308 |
x03 | 658 | 658 |
x03 | 658 | 660 |
x04 | None | 658 |
x04 | None | 660 |
x05 | 658 | 660 |
x06 | 660 | 660 |
The y table contains multiple values for the left join variable (not included in table), resulting in multiple rows per unique DB_ID (string variable, not in df index).
The issue is that only one row is correct, where x_val and y_val match. I tried removing the duplicates with the following code:
df= df[~df['DB_ID'].duplicated() | combined['x_val'] != combined['y_val']]
This however doesn't work. I am looking for a solution to achieve the following result:
DB_ID | x_val | y_val |
---|---|---|
x01 | 405 | 405 |
x02 | 308 | 308 |
x03 | 658 | 658 |
x04 | None | 658 |
x05 | 658 | 660 |
x06 | 660 | 660 |
CodePudding user response:
Idea is compare both column for not equal, then sorting and reove duplicates by DB_ID
:
df = (df.assign(new = df['x_val'].ne(df['y_val']))
.sort_values(['DB_ID','new'])
.drop_duplicates('DB_ID')
.drop('new', axis=1))
print (df)
DB_ID x_val y_val
1 x01 405 405
3 x02 308 308
4 x03 658 658
6 x04 None 658
8 x05 658 660
9 x06 660 660
If need equal NaN
s or None
s use:
df = (df.assign(new = df['x_val'].fillna('same').ne(df['y_val'].fillna('same')))
.sort_values(['DB_ID','new'])
.drop_duplicates('DB_ID')
.drop('new', axis=1))
CodePudding user response:
Maybe, you can simply use:
df = df[df['x_val'] == df['y_val']]
print(df)
# Output
DB_ID x_val y_val
1 x01 405 405
3 x02 308 308
4 x03 658 658
I think you don't need drop_duplicates
or duplicated
but if you want to ensure there remains only one instance of each DB_ID
, you can append .drop_duplicates('DB_ID')
df = df[df['x_val'] == df['y_val']].drop_duplicates('DB_ID')
print(df)
# Output
DB_ID x_val y_val
1 x01 405 405
3 x02 308 308
4 x03 658 658