I would like to create a new column for 'df' that holds a list of every row's value of 'ddf['A']' that matches ("close to") df['A'].
Example:
Input:
df
Out[19]:
A
0 10
1 20
2 30
3 40
ddf
Out[20]:
A
0 9
1 10
2 31
3 50
4 60
Output
df
Out[22]:
A B
0 10 [9, 10]
1 20 []
2 30 [31]
3 40 []
The below code does not work, and fails if df and ddf are different sizes. If the same size, it just checks one row of df against one row for ddf (ie: 1 row : 1 row).
df['B'] = ddf[ddf['A'].between(df['A']*0.8, df['A']*1.2)]['A']
I want to do one row of df is compared against the ENTIRE table of ddf for 'close' matches. ie: 1 row: N rows.
This must be done efficiently.
CodePudding user response:
Fastest that I can come up with is as follows:
import pandas as pd
data = {'A': [10, 20, 30, 40]}
df = pd.DataFrame(data)
ddata = {'A': [9, 10, 31, 50, 60]}
dff = pd.DataFrame(ddata)
def f(df1, df2):
for val in df1:
b = (df2 >= val*.8) & (df2 <= val*1.2)
yield df2[b]
df['B'] = list(f(df.A.to_numpy(), dff.A.to_numpy()))
print(df)
A B
0 10 [9, 10]
1 20 []
2 30 [31]
3 40 []
If I enlarge both dfs by factor 5000 (e.g. data = {'A': [10, 20, 30, 40]*5000}
), the calculation takes just under 2 seconds. The solution offered using df.A.apply
actually takes just over 100 seconds.
CodePudding user response:
import pandas as pd
df = pd.DataFrame([[10], [20], [30], [40]], columns=['A'])
dff = pd.DataFrame([[9], [10], [31], [50], [60]], columns=['A'])
df['B'] = df.A.apply(lambda x: [v for v in dff.A if 0.8 * x <= v <= 1.2 * x])
print(df)
prints
index | A | B |
---|---|---|
0 | 10 | 9,10 |
1 | 20 | |
2 | 30 | 31 |
3 | 40 |