Find index of row from df for which |df$col1-df2$col1| = min|dist|. Then find max value of col2 from-CodePudding

I have two data frames df1 and df2.

df1 has two columns. I'd like to find the index idx of row from df1, that contains in col1 the minimum distance from the value of df2 col1.

Next, I want to find the max value of df1 col2 from neighborhood [idx-10, idx 10]. Then attach that value to df2 to corresponding row for which the min distance was found.

Here is an example of the two data frames:

 df1 = pd.DataFrame({"A": [1,1.5,2.5,3,3.5,3.95,4.55,5.1, 5.54, 6.02, 6.57], 
"B": [0.25, 0.6, 1.34, 0.75, 0.16, 1.04, 0.99, 0.23, 0.95, 0.65,0.22]})
       A     B
0   1.00  0.25
1   1.50  0.60
2   2.50  1.34
3   3.00  0.75
4   3.50  0.16
5   3.95  1.04
6   4.55  0.99
7   5.10  0.23
8   5.54  0.95
9   6.02  0.65
10  6.57  0.22



df2 = pd.DataFrame({"A1": [1.06, 2.34, 3.70, 6, 7.2], "C": ["pink", "blue", "orange", "pink", "orange"]})



      A1       C
0  1.06    pink
1  2.34    blue
2  3.70  orange
3  6.00    pink
4  7.20  orange

For the convenience sake, I'll choose the interval [idx-2;idx 2]. Here is the desired result:

  df3 = pd.DataFrame({"A1":[1.06,2.34, 3.70, 6, 7.2], "B": [1.34, 1.34, 1.04, 0.95, 0.95], "C": ["pink", "blue", "orange",
"pink", "orange"]})
   
     A1     B       C
0  1.06  1.34    pink
1  2.34  1.34    blue
2  3.70  1.04  orange
3  6.00  0.95    pink
4  7.20  0.95  orange

I know that I could merge these data frames, but the problem is that one of them, df1, is very large, so it would take ridiculous amount of memory to compute all possible distances of columns A and A1. So, maybe there is something more efficient than that?

CodePudding user response：

Assuming df1 is already sorted on column A, Lets use searchsorted to get the indices of closest values in A corresponding to the values from A1 , then make sure to clip the indices where the index value equal to length of df1. Now, do a centered rolling max on df1.B, finally use the closest indices to get the corresponding maximum values within a interval

i = np.searchsorted(df1['A'], df2['A1'], side='right')
i[i == len(df1)] -= 1

w = 2
r = df1['B'].rolling(2 * w   1, center=True, min_periods=1).max()
df2['B'] = r.iloc[i].tolist()

     A1       C     B
0  1.06    pink  1.34
1  2.34    blue  1.34
2  3.70  orange  1.04
3  6.00    pink  0.95
4  7.20  orange  0.95