Efficiently compare ROW of dataframe with another ENTIRE dataframe. 1 row:N rows-CodePudding

I would like to create a new column for 'df' that holds a list of every row's value of 'ddf['A']' that matches ("close to") df['A'].

Example:

Input:

df
Out[19]: 
    A
0  10
1  20
2  30
3  40
ddf
Out[20]: 
    A
0   9
1  10
2  31
3  50
4  60

Output

df
Out[22]: 
    A        B
0  10  [9, 10]
1  20       []
2  30     [31]
3  40       []

The below code does not work, and fails if df and ddf are different sizes. If the same size, it just checks one row of df against one row for ddf (ie: 1 row : 1 row).

df['B'] = ddf[ddf['A'].between(df['A']*0.8, df['A']*1.2)]['A']

I want to do one row of df is compared against the ENTIRE table of ddf for 'close' matches. ie: 1 row: N rows.

This must be done efficiently.

CodePudding user response：

Fastest that I can come up with is as follows:

import pandas as pd

data = {'A': [10, 20, 30, 40]} 
df = pd.DataFrame(data)

ddata = {'A': [9, 10, 31, 50, 60]}
dff = pd.DataFrame(ddata)
    
def f(df1, df2):
    for val in df1:
        b = (df2 >= val*.8) & (df2 <= val*1.2)
        yield df2[b]

df['B'] = list(f(df.A.to_numpy(), dff.A.to_numpy()))
print(df)

    A        B
0  10  [9, 10]
1  20       []
2  30     [31]
3  40       []

If I enlarge both dfs by factor 5000 (e.g. data = {'A': [10, 20, 30, 40]*5000}), the calculation takes just under 2 seconds. The solution offered using df.A.apply actually takes just over 100 seconds.

CodePudding user response：

import pandas as pd

df = pd.DataFrame([[10], [20], [30], [40]], columns=['A'])
dff = pd.DataFrame([[9], [10], [31], [50], [60]], columns=['A'])
df['B'] = df.A.apply(lambda x: [v for v in dff.A if 0.8 * x <= v <= 1.2 * x])
print(df)

prints

index	A	B
0	10	9,10
1	20
2	30	31
3	40