Optimal conditional joining of pandas dataframe-CodePudding

I have a situation where I am trying to join df_a to df_b

In reality, these dataframes have shapes: (389944, 121) and (1098118, 60)

I need to conditionally join these two dataframes if any of the below conditions are true. If multiple, it only needs to be joined once:

df_a.player == df_b.handle 
df_a.website == df_b.url 
df_a.website == df_b.web_addr
df_a.website == df_b.notes

For an example...

df_a:

player	website	merch
michael jordan	www.michaeljordan.com	Y
Lebron James	www.kingjames.com	Y
Kobe Bryant	www.mamba.com	Y
Larry Bird	www.larrybird.com	Y
luka Doncic	www.77.com	N

df_b:

platform	url	web_addr	notes	handle	followers	following
Twitter	https://twitter.com/luka7doncic	www.77.com		luka7doncic	1500000	347
Twitter	www.larrybird.com	https://en.wikipedia.org/wiki/Larry_Bird	www.larrybird.com
Twitter		https://www.michaeljordansworld.com/	www.michaeljordan.com
Twitter	https://twitter.com/kobebryant	https://granitystudios.com/	https://granitystudios.com/	Kobe Bryant	14900000	514
Twitter	fooman.com	thefoo.com	foobar	foobarman	1	1
Twitter	www.stackoverflow.com

Ideally, df_a gets left joined to df_b to bring in the handle, followers, and following fields

player	website	merch	handle	followers	following
michael jordan	www.michaeljordan.com	Y	nh	0	0
Lebron James	www.kingjames.com	Y	null	null	null
Kobe Bryant	www.mamba.com	Y	Kobe Bryant	14900000	514
Larry Bird	www.larrybird.com	Y	nh	0	0
luka Doncic	www.77.com	N	luka7doncic	1500000	347

A minimal, reproducible example is below:

import pandas as pd, numpy as np

df_a = pd.DataFrame.from_dict({'player': {0: 'michael jordan',  1: 'Lebron James',  2: 'Kobe Bryant',  3: 'Larry Bird',  4: 'luka Doncic'}, 'website': {0: 'www.michaeljordan.com',  1: 'www.kingjames.com',  2: 'www.mamba.com',  3: 'www.larrybird.com',  4: 'www.77.com'}, 'merch': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N'}, 'handle': {0: 'nh', 1: np.nan, 2: 'Kobe Bryant', 3: 'nh', 4: 'luka7doncic'}, 'followers': {0: 0.0, 1: np.nan, 2: 14900000.0, 3: 0.0, 4: 1500000.0}, 'following': {0: 0.0, 1: np.nan, 2: 514.0, 3: 0.0, 4: 347.0}})
df_b = pd.DataFrame.from_dict({'platform': {0: 'Twitter',  1: 'Twitter',  2: 'Twitter',  3: 'Twitter',  4: 'Twitter',  5: 'Twitter'}, 'url': {0: 'https://twitter.com/luka7doncic',  1: 'www.larrybird.com',  2: np.nan,  3: 'https://twitter.com/kobebryant',  4: 'fooman.com',  5: 'www.stackoverflow.com'}, 'web_addr': {0: 'www.77.com',  1: 'https://en.wikipedia.org/wiki/Larry_Bird',  2: 'https://www.michaeljordansworld.com/',  3: 'https://granitystudios.com/',  4: 'thefoo.com', 5: np.nan}, 'notes': {0: np.nan,  1: 'www.larrybird.com',  2: 'www.michaeljordan.com',  3: 'https://granitystudios.com/',  4: 'foobar',  5: np.nan}, 'handle': {0: 'luka7doncic',  1: 'nh',  2: 'nh',  3: 'Kobe Bryant',  4: 'foobarman',  5: 'nh'}, 'followers': {0: 1500000, 1: 0, 2: 0, 3: 14900000, 4: 1, 5: 0}, 'following': {0: 347, 1: 0, 2: 0, 3: 514, 4: 1, 5: 0}})

cols_to_join = ['url', 'web_addr', 'notes']

on_handle = df_a.merge(right=df_b, left_on='player', right_on='handle', how='left')

res_df = []
res_df.append(on_handle)
for right_col in cols_to_join:
    try:
        temp = df_a.merge(right=df_b, left_on='website', right_on=right_col, how='left')
    except:
        temp = None
    if temp is not None:
        res_df.append(temp)
    
final = pd.concat(res_df, ignore_index=True)
final.drop_duplicates(inplace=True)

final

However, this produces erroneous results with duplicate columns.

How can I do this more efficiently and with correct results?

CodePudding user response：

Use:

#for same input
df_a = df_a.drop(['handle','followers','following'], axis=1)
# print (df_a)

#meltying df_b for column website from cols_to_join
cols_to_join = ['url', 'web_addr', 'notes']
df2 = df_b.melt(id_vars=df_b.columns.difference(cols_to_join), value_name='website')
#because duplicates, removed dupes by website
df2 = df2.sort_values('followers', ascending=False).drop_duplicates('website')

print (df2)
    followers  following       handle platform  variable  \
9    14900000        514  Kobe Bryant  Twitter  web_addr   
3    14900000        514  Kobe Bryant  Twitter       url   
6     1500000        347  luka7doncic  Twitter  web_addr   
12    1500000        347  luka7doncic  Twitter     notes   
0     1500000        347  luka7doncic  Twitter       url   
10          1          1    foobarman  Twitter  web_addr   
4           1          1    foobarman  Twitter       url   
16          1          1    foobarman  Twitter     notes   
5           0          0           nh  Twitter       url   
7           0          0           nh  Twitter  web_addr   
8           0          0           nh  Twitter  web_addr   
1           0          0           nh  Twitter       url   
14          0          0           nh  Twitter     notes   

                                     website  
9                https://granitystudios.com/  
3             https://twitter.com/kobebryant  
6                                 www.77.com  
12                                       NaN  
0            https://twitter.com/luka7doncic  
10                                thefoo.com  
4                                 fooman.com  
16                                    foobar  
5                      www.stackoverflow.com  
7   https://en.wikipedia.org/wiki/Larry_Bird  
8       https://www.michaeljordansworld.com/  
1                          www.larrybird.com  
14                     www.michaeljordan.com

#2 times merge and because same index values replace missing values
dffin1 = df_a.merge(df_b.drop(cols_to_join   ['platform'], axis=1), left_on='player', right_on='handle', how='left')
dffin2 = df_a.merge(df2.drop(['platform','variable'], axis=1), on='website', how='left')

dffin = dffin2.fillna(dffin1)
print (dffin)
           player                website merch   followers  following  \
0  michael jordan  www.michaeljordan.com     Y         0.0        0.0   
1    Lebron James      www.kingjames.com     Y         NaN        NaN   
2     Kobe Bryant          www.mamba.com     Y  14900000.0      514.0   
3      Larry Bird      www.larrybird.com     Y         0.0        0.0   
4     luka Doncic             www.77.com     N   1500000.0      347.0   

        handle  
0           nh  
1          NaN  
2  Kobe Bryant  
3           nh  
4  luka7doncic

CodePudding user response：

You can pass left_on and right_on with lists -

final = df_a.merge(
    right=df_b, 
    left_on=['player', 'website', 'website', 'website'], 
    right_on=['handle', 'url', 'web_addr', 'notes'],
    how='left'
)