Find the nearest value in dataframe column and calculate the difference for each row-CodePudding

Even though I was googling a lot, I couldn't find the solution for my problem.

I have dataframe

    filter10    REF
0   NaN         0.00
1   NaN         0.75
2   NaN         1.50
3   NaN         2.25
4   NaN         3.00
5   NaN         3.75
6   NaN         4.50
...
15  2.804688    11.25
16  3.021875    12.00
17  3.578125    12.75
18  3.779688    13.50
...
27  NaN         20.25
28  NaN         21.00
29  NaN         21.75
30  NaN         22.50
31  6.746875    NaN
32  NaN         NaN
...

I would like now to add the column df['DIFF'] where function goes through whole column filter10 and when it is the number it finds closest number in REF column.

And afterwards calculate the difference between them and put it the same row as number in filter10 is.

I would like this output:

    filter10    REF     DIFF
0   NaN         0.00    NaN
1   NaN         0.75    NaN
2   NaN         1.50    NaN
3   NaN         2.25    NaN
4   NaN         3.00    NaN
5   NaN         3.75    NaN
6   NaN         4.50    NaN
...
15  2.804688    11.25   -0.195312 # 2.804688 - 3 (find closest value in REF) = -0.195312
16  3.021875    12.00   0.021875
17  3.578125    12.75   -0.171875
18  3.779688    13.50   0.029688
...
27  NaN         20.25   NaN
28  NaN         21.00   NaN
29  NaN         21.75   NaN
30  NaN         22.50   NaN
31  6.746875    NaN     -0.003125
32  NaN         NaN     NaN
...

CodePudding user response：

Use pandas.merge_asof to find the nearest value:

df['DIFF'] = (pd.merge_asof(df['filter10'].dropna().sort_values().reset_index(),
                            df[['REF']].dropna().sort_values('REF'),
                            left_on='filter10', right_on='REF', direction='nearest')
                .set_index('index')['REF'].rsub(df['filter10'])
              )

Output:

    filter10    REF      DIFF
0        NaN   0.00       NaN
1        NaN   0.75       NaN
2        NaN   1.50       NaN
3        NaN   2.25       NaN
4        NaN   3.00       NaN
5        NaN   3.75       NaN
6        NaN   4.50       NaN
15  2.804688  11.25 -0.195312
16  3.021875  12.00  0.021875
17  3.578125  12.75 -0.171875
18  3.779688  13.50  0.029688
27       NaN  20.25       NaN
28       NaN  21.00       NaN
29       NaN  21.75       NaN
30       NaN  22.50       NaN
31  6.746875    NaN  2.246875 # likely different due to missing data
32       NaN    NaN       NaN

CodePudding user response：

As an alternative, one can use cKDTree for this:

from scipy.spatial import cKDTree

tree = cKDTree(df.REF.values[:,None])

df['DIFF'] = df.filter10 - np.array([df.REF[i] if not np.isinf(dist) else np.nan for dist,i in [tree.query(x,1) for x in df.filter10]]) 

#    filter10    REF      DIFF
#0        NaN   0.00       NaN
#1        NaN   0.75       NaN
#2        NaN   1.50       NaN
#3        NaN   2.25       NaN
#4        NaN   3.00       NaN
#5        NaN   3.75       NaN
#6        NaN   4.50       NaN
#15  2.804688  11.25 -0.195312
#16  3.021875  12.00  0.021875
#17  3.578125  12.75 -0.171875
#18  3.779688  13.50  0.029688
#27       NaN  20.25       NaN
#28       NaN  21.00       NaN
#29       NaN  21.75       NaN
#30       NaN  22.50       NaN
#31  6.746875    NaN  2.246875
#32       NaN    NaN       NaN

The query method returns infinity when the point in question is nan.