I have a very large pandas dataframe with two columns, A and B. For each row containing values a and b in columns A and B respectively, I'd like to find another row with values a' and b' so that the absolute difference between a and a' is as small as possible. I would like to create two new columns: a column containing the "distance" between the two rows (i.e., abs(a - a')), and a column containing b'.
Here are a couple of exmaples. Let's say we have the following dataframe:
df = pd.DataFrame({'A' : [1, 5, 7, 2, 3, 4], 'B' : [5, 2, 7, 5, 1, 9]})
The first row has (a, b) = (1, 5)
. The two new columns for
this row would contain the values 1
and 5
. Why? Because the closest value to a = 1
is a' = 2
, which occurs in the fourth row. The value of b'
in that row is 5
.
The second row has (a, b) = (5, 2)
. The two new columns for this row would contain the values 1
and 9
. The closest value to a = 5
is a' = 4
, which occurs in the last row. The corresponding value of b'
in that row is 9
.
If the value of a'
that minimizes (a - a')
isn't unique, ties can be broken arbitrarily (or you can keep all entries).
I believe I need to use the pandas.merge_asof
function, which allows for approximate joining. I also think that I need to set merge_asof
function's direction
keyword argument to nearest
, which will allow selecting the closest (in absolute distance) to the left dataframe's key.
I've read the entire documentation (with examples) for pandas.merge_asof, but forming the correct query is a little bit tricky for me.
CodePudding user response:
Use merge_asof
with allow_exact_matches=False
and direction='nearest'
parameters, last for A1
subtract A
column with absolute values:
df1 = df.sort_values('A')
df = pd.merge_asof(df1,
df1.rename(columns={'A':'A1', 'B':'B1'}),
left_on='A',
right_on='A1',
allow_exact_matches=False,
direction='nearest')
df['A1'] = df['A1'].sub(df['A']).abs()
print (df)
A B A1 B1
0 1 5 1 5
1 2 5 1 5
2 3 1 1 5
3 4 9 1 1
4 5 2 1 9
5 7 7 2 2