I need to merge 2 data frames on 3 key columns, one contains strings, one contains integers and the last contains floats. On the string and integer columns, I would need an exact match, while on the float column I want to get the values of the right data frame for the nearest corresponding float key in the left data frame. I thought that pandas' merge_asof function would be good, but unfortunately, it can be used only for a single, numerical column to join on.
I have approached this problem with cross merge, but unfortunately, it won't work for the real problem, as each data frame has ~2 million rows. (and it's not giving the correct values either...)
import pandas as pd
import numpy as np
import itertools
df1 = pd.DataFrame(data = itertools.product(['a', 'b', 'c'],
[0, 1, 2, 3, 4],
[0, 0.25, 0.5, 0.75, 1]),
columns = ['key1', 'key2', 'key3'])
df2 = df1.copy()
df2['key3'] = np.random.rand(len(df2))
df2['values'] = np.random.rand(len(df2))
merged = df1.merge(df2, how = 'cross')
merged['key3_diff'] = np.abs(merged['key3_x'] - merged['key3_y'])
aux = merged[['key1_x', 'key2_x',
'key3_x', 'key3_diff']].groupby(['key1_x',
'key2_x',
'key3_x']).min().reset_index()
res = merged[merged['key1_x'].isin(aux['key1_x']) &
merged['key2_x'].isin(aux['key2_x']) &
merged['key3_diff'].isin(aux['key3_diff'])][['key1_x',
'key2_x',
'key3_x',
'values']].rename(columns = {'key1_x': 'key1',
'key2_x': 'key2',
'key3_x': 'key3'})
Is there any alternative besides implementing this calculation sequentially (maybe with numba and jit)?
CodePudding user response:
IIUC, Try something like this:
# Creates a Cartesian on key1, and key2 match all key3 in df1 to
# all key3 in df2 by key1 and key
dfm = df1.merge(df2, on=['key1', 'key2'])
# Create a measure to find minimum match on
dfm = dfm.eval('key_diff = abs(key3_y - key3_x)')
# Find isolate minimum match per record in df1 key1, key2 and key3_x.
idx = dfm.groupby(['key1', 'key2', 'key3_x'])['key_diff'].idxmin().to_numpy()
# Filter merged data frame
dfm.reindex(idx)
Output:
key1 key2 key3_x key3_y values key_diff
3 a 0 0.00 0.155915 0.993490 0.155915
7 a 0 0.25 0.206727 0.620721 0.043273
11 a 0 0.50 0.705011 0.942236 0.205011
19 a 0 0.75 0.705943 0.124545 0.044057
20 a 0 1.00 0.979249 0.089052 0.020751
.. ... ... ... ... ... ...
350 c 4 0.00 0.269154 0.437154 0.269154
355 c 4 0.25 0.269154 0.437154 0.019154
362 c 4 0.50 0.714185 0.927274 0.214185
367 c 4 0.75 0.714185 0.927274 0.035815
373 c 4 1.00 0.908999 0.078522 0.091001
[75 rows x 6 columns]
CodePudding user response:
This is a solution using pandas's merge_asof. I didn't realize the 'by' argument. This is equivalent with the accepted answer.
df1 = df1.sort_values(by = 'key3')
df2 = df2.sort_values(by = 'key3')
res3 = pd.merge_asof(df1, df2, by = ['key1', 'key2'], on = 'key3', direction = 'nearest')