I have 2 subsets that have similar columns, but the one column they have in common is column A
.
I have the left df L
and the right df R
.
I want to make sure that any duplicates for column A
seen in L
that are also in df R
are removed from L
- the whole column.
How would one do this?
import pandas as pd
L_df = pd.DataFrame({'A': ['bob/is/cool', 'alice/is/cool', 'jim/is/cool'],
'view': ['A', 'B', 'B']})
R_df = pd.DataFrame({'A': ['ralf/is/cool', 'i/am/cool', 'alice/is/cool'],
'view': ['A', 'B', 'C']})
I want to get the result of this with the result taking away duplicates for column A
, and taking the duplicated value from L
not R
.
So we take alice/is/cool
with a view
value of C
and not B
if that makes sense :)
Output would be
out = pd.DataFrame({'A': ['ralf/is/cool', 'i/am/cool', 'alice/is/cool', 'bob/is/cool', 'jim/is/cool'],
'view': ['A', 'B', 'C', 'A', 'B']})
CodePudding user response:
Would this be what you are after?
>>> pd.concat([R_df, L_df]).drop_duplicates(keep='first', subset='A')
A view
0 ralf/is/cool A
1 i/am/cool B
2 alice/is/cool C
0 bob/is/cool A
2 jim/is/cool B
Note: this is a wild guess based on your description.
It will indiscriminately remove any duplicates (within R, within L, or in the concatenation of both) and keep just the first one.
You may want a more subtle disposition of cases depending on where and how many duplicates you have, but it's hard to tell without a more robust set of examples.