Smallest difference from every row in a dataframe-CodePudding

A = [1,3,7]
B = [6,4,8]
C = [2, 2, 8]
datetime = ['2022-01-01', '2022-01-02', '2022-01-03']
df1 = pd.DataFrame({'DATETIME':datetime,'A':A,'B':B, 'C':C })
df1.set_index('DATETIME', inplace = True)
df1
                 
A = [1,3,7,6, 8]
B = [3,8,10,5, 8]
C = [5, 7, 9, 6, 5]

datetime = ['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04', '2022-03-05']
df2 = pd.DataFrame({'DATETIME':datetime,'A':A,'B':B, 'C':C  })
df2.set_index('DATETIME', inplace = True)
df2

I want to compare the difference between every row of df1 to that of df2 and output that date for each row in df1. Lets take the first row in df1 (2022-01-01) where A=1, B=6, and C = 2. Comparing that to df2 2022-03-01 where A=1, B = 3, and C = 5, we get a total difference of 1-1=0, 6-3=3, and 2-5 = 3 for a total of 0 3 3= 6 total difference. Comparing that 2022-01-01 to the rest of df2 we see that 2022-03-01 is the lowest total difference and would like the date in df1.

CodePudding user response：

I'm assuming that you want the lowest total absolute difference.

The fastest way is probably to convert the DataFrames to numpy arrays, and use numpy broadcasting to efficiently perform the computations.

# for each row of df1 get the (positional) index of the df2 row corresponding to the lowest total absolute difference
min_idx = abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(axis=-1).argmin(axis=1)

df1['min_diff_date'] = df2.index[min_idx]

Output:

>>> df1

            A  B  C min_diff_date
DATETIME                         
2022-01-01  1  6  2    2022-03-01
2022-01-02  3  4  2    2022-03-01
2022-01-03  7  8  8    2022-03-03

Steps:

# Each 'block' corresponds to the absolute difference between a row of df1 and all the rows of df2 
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy())

array([[[0, 3, 3],
        [2, 2, 5],
        [6, 4, 7],
        [5, 1, 4],
        [7, 2, 3]],

       [[2, 1, 3],
        [0, 4, 5],
        [4, 6, 7],
        [3, 1, 4],
        [5, 4, 3]],

       [[6, 5, 3],
        [4, 0, 1],
        [0, 2, 1],
        [1, 3, 2],
        [1, 0, 3]]])

# sum the absolute differences over the columns of each block
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(-1)

array([[ 6,  9, 17, 10, 12],
       [ 6,  9, 17,  8, 12],
       [14,  5,  3,  6,  4]])

# for each row of the previous array get the column index of the lowest value
>>> abs(df1.to_numpy()[:,None] - df2.to_numpy()).sum(-1).argmin(1)

array([0, 0, 2])