I have huge data and I want merge it on a specific column where values from df1 is not available in df2 and vice versa.
Example:
df1:
Domain Sales
google.com 100
facebook.com 200
youtube.com 300
df2:
Domain Sales
google.com 100
yahoo.com 200
youtube.com 300
Required output:
Domain Sales
facebook.com 200
yahoo.com 200
I have tried:
df = pd.merge(df1, df2, on="Domain", how="outer")
and all the other values for the how
parameter, but it does not give me the required output. How can I achieve the required output?
CodePudding user response:
pd.concat([df1, df2]).drop_duplicates('Domain', keep=False)
CodePudding user response:
As you have a huge dataset you can use set operations to pre-filter the data before concat
:
S = set(df1['Domain']).symmetric_difference(df2['Domain'])
out = pd.concat([d.loc[lambda d: d['Domain'].isin(S)] for d in [df1, df2]])
output:
Domain Sales
1 facebook.com 200
1 yahoo.com 200
CodePudding user response:
We still can try with merge
df = df1.merge(df2,on = 'Domain', indicator = True, how='outer').query('_merge!="both"')
df['Sales'] = df.pop('Sales_x').fillna(df.pop('Sales_y'))
df
Out[342]:
Domain _merge Sales
1 facebook.com left_only 200.0
3 yahoo.com right_only 200.0