Find value smaller but closest to current value-CodePudding

I have a very large pandas dataframe that contains two columns, column A and column B. For each value in column A, I would like to find the largest value in column B that is less than the corresponding value in column A. Note that each value in column B can be mapped to many values in column A.

Here's an example with a smaller dataset. Let's say I have the following dataframe:

df = pd.DataFrame({'a' : [1, 5, 7, 2, 3, 4], 'b' : [5, 2, 7, 5, 1, 9]})

I would like to find some third column -- say c -- so that

c = [nil, 2, 5, 1, 2, 2].

Note that each entry in c is strictly less than the corresponding value in c.

Upon researching, I think that I want something similar to pandas.merge_asof, but I can't quite get the query correct. In particular, I'm struggling because I only have one dataframe and not two. Perhaps I can form a second dataframe from my current one to get what I need, but I can't quite get it right. Any help is appreciated.

CodePudding user response：

Yes, it is doable using pandas.merge_asof. Explanation as comments in the code -

import pandas as pd
df = pd.DataFrame({'a' : [1, 5, 7, 2, 3, 4], 'b' : [5, 2, 7, 5, 1, 9]})

# merge_asof requires the keys to be sorted
adf = df[['a']].sort_values(by='a')
bdf = df[['b']].sort_values(by='b')

# your example wants 'strictly less' so we also add 'allow_exact_matches=False'
cdf_ordered = pd.merge_asof(adf, bdf, left_on='a', right_on='b', allow_exact_matches=False, direction='backward')

# rename the dataframe |a|b| -> |a|c|
cdf_ordered = cdf_ordered.rename(columns={'b': 'c'})

# since c is based on sorted a, we merge with original dataframe column a
new_df = pd.merge(df, cdf_ordered, on='a')
print(new_df)

"""
   a  b    c
0  1  5  NaN
1  5  2  2.0
2  7  7  5.0
3  2  5  1.0
4  3  1  2.0
5  4  9  2.0
"""