I have a situation where I am trying to join df_a
to df_b
In reality, these dataframes
have shapes: (389944, 121)
and (1098118, 60)
I need to conditionally join these two dataframes if any of the below conditions are true. If multiple, it only needs to be joined once:
df_a.player == df_b.handle
df_a.website == df_b.url
df_a.website == df_b.web_addr
df_a.website == df_b.notes
For an example...
df_a:
player | website | merch |
---|---|---|
michael jordan | www.michaeljordan.com | Y |
Lebron James | www.kingjames.com | Y |
Kobe Bryant | www.mamba.com | Y |
Larry Bird | www.larrybird.com | Y |
luka Doncic | www.77.com | N |
df_b:
platform | url | web_addr | notes | handle | followers | following |
---|---|---|---|---|---|---|
https://twitter.com/luka7doncic | www.77.com | luka7doncic | 1500000 | 347 | ||
www.larrybird.com | https://en.wikipedia.org/wiki/Larry_Bird | www.larrybird.com | ||||
https://www.michaeljordansworld.com/ | www.michaeljordan.com | |||||
https://twitter.com/kobebryant | https://granitystudios.com/ | https://granitystudios.com/ | Kobe Bryant | 14900000 | 514 | |
fooman.com | thefoo.com | foobar | foobarman | 1 | 1 | |
www.stackoverflow.com |
Ideally, df_a
gets left joined
to df_b
to bring in the handle
, followers
, and following
fields
player | website | merch | handle | followers | following |
---|---|---|---|---|---|
michael jordan | www.michaeljordan.com | Y | nh | 0 | 0 |
Lebron James | www.kingjames.com | Y | null | null | null |
Kobe Bryant | www.mamba.com | Y | Kobe Bryant | 14900000 | 514 |
Larry Bird | www.larrybird.com | Y | nh | 0 | 0 |
luka Doncic | www.77.com | N | luka7doncic | 1500000 | 347 |
A minimal, reproducible example is below:
import pandas as pd, numpy as np
df_a = pd.DataFrame.from_dict({'player': {0: 'michael jordan', 1: 'Lebron James', 2: 'Kobe Bryant', 3: 'Larry Bird', 4: 'luka Doncic'}, 'website': {0: 'www.michaeljordan.com', 1: 'www.kingjames.com', 2: 'www.mamba.com', 3: 'www.larrybird.com', 4: 'www.77.com'}, 'merch': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N'}, 'handle': {0: 'nh', 1: np.nan, 2: 'Kobe Bryant', 3: 'nh', 4: 'luka7doncic'}, 'followers': {0: 0.0, 1: np.nan, 2: 14900000.0, 3: 0.0, 4: 1500000.0}, 'following': {0: 0.0, 1: np.nan, 2: 514.0, 3: 0.0, 4: 347.0}})
df_b = pd.DataFrame.from_dict({'platform': {0: 'Twitter', 1: 'Twitter', 2: 'Twitter', 3: 'Twitter', 4: 'Twitter', 5: 'Twitter'}, 'url': {0: 'https://twitter.com/luka7doncic', 1: 'www.larrybird.com', 2: np.nan, 3: 'https://twitter.com/kobebryant', 4: 'fooman.com', 5: 'www.stackoverflow.com'}, 'web_addr': {0: 'www.77.com', 1: 'https://en.wikipedia.org/wiki/Larry_Bird', 2: 'https://www.michaeljordansworld.com/', 3: 'https://granitystudios.com/', 4: 'thefoo.com', 5: np.nan}, 'notes': {0: np.nan, 1: 'www.larrybird.com', 2: 'www.michaeljordan.com', 3: 'https://granitystudios.com/', 4: 'foobar', 5: np.nan}, 'handle': {0: 'luka7doncic', 1: 'nh', 2: 'nh', 3: 'Kobe Bryant', 4: 'foobarman', 5: 'nh'}, 'followers': {0: 1500000, 1: 0, 2: 0, 3: 14900000, 4: 1, 5: 0}, 'following': {0: 347, 1: 0, 2: 0, 3: 514, 4: 1, 5: 0}})
cols_to_join = ['url', 'web_addr', 'notes']
on_handle = df_a.merge(right=df_b, left_on='player', right_on='handle', how='left')
res_df = []
res_df.append(on_handle)
for right_col in cols_to_join:
try:
temp = df_a.merge(right=df_b, left_on='website', right_on=right_col, how='left')
except:
temp = None
if temp is not None:
res_df.append(temp)
final = pd.concat(res_df, ignore_index=True)
final.drop_duplicates(inplace=True)
final
However, this produces erroneous results with duplicate columns.
How can I do this more efficiently and with correct results?
CodePudding user response:
Use:
#for same input
df_a = df_a.drop(['handle','followers','following'], axis=1)
# print (df_a)
#meltying df_b for column website from cols_to_join
cols_to_join = ['url', 'web_addr', 'notes']
df2 = df_b.melt(id_vars=df_b.columns.difference(cols_to_join), value_name='website')
#because duplicates, removed dupes by website
df2 = df2.sort_values('followers', ascending=False).drop_duplicates('website')
print (df2)
followers following handle platform variable \
9 14900000 514 Kobe Bryant Twitter web_addr
3 14900000 514 Kobe Bryant Twitter url
6 1500000 347 luka7doncic Twitter web_addr
12 1500000 347 luka7doncic Twitter notes
0 1500000 347 luka7doncic Twitter url
10 1 1 foobarman Twitter web_addr
4 1 1 foobarman Twitter url
16 1 1 foobarman Twitter notes
5 0 0 nh Twitter url
7 0 0 nh Twitter web_addr
8 0 0 nh Twitter web_addr
1 0 0 nh Twitter url
14 0 0 nh Twitter notes
website
9 https://granitystudios.com/
3 https://twitter.com/kobebryant
6 www.77.com
12 NaN
0 https://twitter.com/luka7doncic
10 thefoo.com
4 fooman.com
16 foobar
5 www.stackoverflow.com
7 https://en.wikipedia.org/wiki/Larry_Bird
8 https://www.michaeljordansworld.com/
1 www.larrybird.com
14 www.michaeljordan.com
#2 times merge and because same index values replace missing values
dffin1 = df_a.merge(df_b.drop(cols_to_join ['platform'], axis=1), left_on='player', right_on='handle', how='left')
dffin2 = df_a.merge(df2.drop(['platform','variable'], axis=1), on='website', how='left')
dffin = dffin2.fillna(dffin1)
print (dffin)
player website merch followers following \
0 michael jordan www.michaeljordan.com Y 0.0 0.0
1 Lebron James www.kingjames.com Y NaN NaN
2 Kobe Bryant www.mamba.com Y 14900000.0 514.0
3 Larry Bird www.larrybird.com Y 0.0 0.0
4 luka Doncic www.77.com N 1500000.0 347.0
handle
0 nh
1 NaN
2 Kobe Bryant
3 nh
4 luka7doncic
CodePudding user response:
You can pass left_on
and right_on
with lists -
final = df_a.merge(
right=df_b,
left_on=['player', 'website', 'website', 'website'],
right_on=['handle', 'url', 'web_addr', 'notes'],
how='left'
)