I have a very large pandas dataframe that contains two columns, column A and column B. For each value in column A, I would like to find the largest value in column B that is less than the corresponding value in column A. Note that each value in column B can be mapped to many values in column A.
Here's an example with a smaller dataset. Let's say I have the following dataframe:
df = pd.DataFrame({'a' : [1, 5, 7, 2, 3, 4], 'b' : [5, 2, 7, 5, 1, 9]})
I would like to find some third column -- say c -- so that
c = [nil, 2, 5, 1, 2, 2].
Note that each entry in c
is strictly less than the corresponding value in c
.
Upon researching, I think that I want something similar to pandas.merge_asof
, but I can't quite get the query correct. In particular, I'm struggling because I only have one dataframe and not two. Perhaps I can form a second dataframe from my current one to get what I need, but I can't quite get it right. Any help is appreciated.
CodePudding user response:
Yes, it is doable using pandas.merge_asof
. Explanation as comments in the code -
import pandas as pd
df = pd.DataFrame({'a' : [1, 5, 7, 2, 3, 4], 'b' : [5, 2, 7, 5, 1, 9]})
# merge_asof requires the keys to be sorted
adf = df[['a']].sort_values(by='a')
bdf = df[['b']].sort_values(by='b')
# your example wants 'strictly less' so we also add 'allow_exact_matches=False'
cdf_ordered = pd.merge_asof(adf, bdf, left_on='a', right_on='b', allow_exact_matches=False, direction='backward')
# rename the dataframe |a|b| -> |a|c|
cdf_ordered = cdf_ordered.rename(columns={'b': 'c'})
# since c is based on sorted a, we merge with original dataframe column a
new_df = pd.merge(df, cdf_ordered, on='a')
print(new_df)
"""
a b c
0 1 5 NaN
1 5 2 2.0
2 7 7 5.0
3 2 5 1.0
4 3 1 2.0
5 4 9 2.0
"""