Is there any alternative of pandas' merge_asof when joining on multiple columns?-CodePudding

I need to merge 2 data frames on 3 key columns, one contains strings, one contains integers and the last contains floats. On the string and integer columns, I would need an exact match, while on the float column I want to get the values of the right data frame for the nearest corresponding float key in the left data frame. I thought that pandas' merge_asof function would be good, but unfortunately, it can be used only for a single, numerical column to join on.

I have approached this problem with cross merge, but unfortunately, it won't work for the real problem, as each data frame has ~2 million rows. (and it's not giving the correct values either...)

import pandas as pd
import numpy as np
import itertools

df1 = pd.DataFrame(data = itertools.product(['a', 'b', 'c'],
                                            [0, 1, 2, 3, 4],
                                            [0, 0.25, 0.5, 0.75, 1]), 
                   columns = ['key1', 'key2', 'key3'])

df2 = df1.copy()
df2['key3'] = np.random.rand(len(df2))
df2['values'] = np.random.rand(len(df2))

merged = df1.merge(df2, how = 'cross')
merged['key3_diff'] = np.abs(merged['key3_x'] - merged['key3_y'])

aux = merged[['key1_x', 'key2_x', 
              'key3_x', 'key3_diff']].groupby(['key1_x', 
                                               'key2_x', 
                                               'key3_x']).min().reset_index()

res = merged[merged['key1_x'].isin(aux['key1_x']) &
           merged['key2_x'].isin(aux['key2_x']) &
           merged['key3_diff'].isin(aux['key3_diff'])][['key1_x', 
                                                        'key2_x', 
                                                        'key3_x', 
                                                        'values']].rename(columns = {'key1_x': 'key1',
                                                                                     'key2_x': 'key2',
                                                                                     'key3_x': 'key3'})

Is there any alternative besides implementing this calculation sequentially (maybe with numba and jit)?

CodePudding user response：

IIUC, Try something like this:

# Creates a Cartesian on key1, and key2 match all key3 in df1 to 
# all key3 in df2 by key1 and key
dfm = df1.merge(df2, on=['key1', 'key2'])

# Create a measure to find minimum match on  
dfm = dfm.eval('key_diff = abs(key3_y - key3_x)')

# Find isolate minimum match per record in df1 key1, key2 and key3_x.
idx = dfm.groupby(['key1', 'key2', 'key3_x'])['key_diff'].idxmin().to_numpy()

# Filter merged data frame
dfm.reindex(idx)

Output:

    key1  key2  key3_x    key3_y    values  key_diff
3      a     0    0.00  0.155915  0.993490  0.155915
7      a     0    0.25  0.206727  0.620721  0.043273
11     a     0    0.50  0.705011  0.942236  0.205011
19     a     0    0.75  0.705943  0.124545  0.044057
20     a     0    1.00  0.979249  0.089052  0.020751
..   ...   ...     ...       ...       ...       ...
350    c     4    0.00  0.269154  0.437154  0.269154
355    c     4    0.25  0.269154  0.437154  0.019154
362    c     4    0.50  0.714185  0.927274  0.214185
367    c     4    0.75  0.714185  0.927274  0.035815
373    c     4    1.00  0.908999  0.078522  0.091001

[75 rows x 6 columns]

CodePudding user response：

This is a solution using pandas's merge_asof. I didn't realize the 'by' argument. This is equivalent with the accepted answer.

df1 = df1.sort_values(by = 'key3')
df2 = df2.sort_values(by = 'key3')

res3 = pd.merge_asof(df1, df2, by = ['key1', 'key2'], on = 'key3', direction = 'nearest')