Home > Enterprise >  Find row with closest value
Find row with closest value

Time:06-27

I have a very large pandas dataframe with two columns, A and B. For each row containing values a and b in columns A and B respectively, I'd like to find another row with values a' and b' so that the absolute difference between a and a' is as small as possible. I would like to create two new columns: a column containing the "distance" between the two rows (i.e., abs(a - a')), and a column containing b'.


Here are a couple of exmaples. Let's say we have the following dataframe:

df = pd.DataFrame({'A' : [1, 5, 7, 2, 3, 4], 'B' : [5, 2, 7, 5, 1, 9]})

The first row has (a, b) = (1, 5). The two new columns for this row would contain the values 1 and 5. Why? Because the closest value to a = 1 is a' = 2, which occurs in the fourth row. The value of b' in that row is 5.

The second row has (a, b) = (5, 2). The two new columns for this row would contain the values 1 and 9. The closest value to a = 5 is a' = 4, which occurs in the last row. The corresponding value of b' in that row is 9.

If the value of a' that minimizes (a - a') isn't unique, ties can be broken arbitrarily (or you can keep all entries).


I believe I need to use the pandas.merge_asof function, which allows for approximate joining. I also think that I need to set merge_asof function's direction keyword argument to nearest, which will allow selecting the closest (in absolute distance) to the left dataframe's key.

I've read the entire documentation (with examples) for pandas.merge_asof, but forming the correct query is a little bit tricky for me.

CodePudding user response:

Use merge_asof with allow_exact_matches=False and direction='nearest' parameters, last for A1 subtract A column with absolute values:

df1 = df.sort_values('A')
df = pd.merge_asof(df1, 
                   df1.rename(columns={'A':'A1', 'B':'B1'}), 
                   left_on='A', 
                   right_on='A1', 
                   allow_exact_matches=False, 
                   direction='nearest')

df['A1'] = df['A1'].sub(df['A']).abs()
print (df)
   A  B  A1  B1
0  1  5   1   5
1  2  5   1   5
2  3  1   1   5
3  4  9   1   1
4  5  2   1   9
5  7  7   2   2
  • Related