Cross Join, Compare Values, and Select Closest Match

I asked a similar question in sql format here: Cross Join, Compare Values, and Select Closest Match - More Efficient Way?

I have two tables with 3 columns each. I cross join and subtract the values. I then find the t2.id that has the closest vals to t1.id

These tables are quite large (50k rows in t1, 2m in t2, comparing 30 columns or more in real problem).

What's the most efficient way to write this? It would be nice if id_y was only used once at most, but not really critical

import pandas as pd

# sample tables
t1  = {'id':['a1', 'a2', 'a3'], 'val1':[0.11,0.22,0.33], 'val2':[0.44,0.55,0.66]}
t2 = {'id':['b1', 'b2', 'b3'], 'val1':[0.99,0.77,0.55], 'val2':[0.22,0.44,0.66]}

df1 = pd.DataFrame(t1)
df2 = pd.DataFrame(t2)

print(df1)
print(df2)


# cross join
df1['key'] = 0
df2['key'] = 0
df3 = df1.merge(df2, on='key', how='outer')

print(df3)

# calculate error
df3['err'] = abs(df3['val1_x']-df3['val1_y'])   abs(df3['val2_x']-df3['val2_y'])   abs(df3['val2_x']-df3['val2_y']) 

# choose lowest error for each t1.id
# would be nice if it used id_y max once each
df4 = df3.loc[df3.groupby('id_x')['err'].idxmin()]
df5 = df4[['id_x', 'id_y', 'err']]


print(df5)

CodePudding user response：

It looks you are trying to compare every row in df1 with every row in df2 and identify the pairing with the lowest difference. This requires len(df1) * len(df2) * n_cols calculations, which is 3 trillions in your case.

Even more problematic is memory usage, as it requires 24 trillion bytes = ~22TB to store the calculation result (8 bytes per float) before you can further process it.

The solution below uses numpy broadcasting with chunking so that memory consumption is reduced to chunk_size * len(df2) * n_cols * 8. In the end, it still performs 3 trillions calculations so it will take some time, despite being vectorized:

v1 = df1.loc[:, "val1":"val30"].to_numpy()[:, None]
v2 = df2.loc[:, "val1":"val30"].to_numpy()

# My computer with 16GB of RAM can only handle 10 rows of df1 at a time
# Adjust this number according to your PC's memory
chunk_size = 10
closest = np.zeros(len(df1))

for i in range(0, len(df1), chunk_size):
    chunk = slice(i, i   chunk_size)
    closest[chunk] = np.abs(v1[chunk] - v2).sum(axis=-1).argmin(axis=-1)

result = pd.DataFrame({
    "id_x": df1["id"],
    "id_y": df2["id"].iloc[closest].to_list()
})