Home > other >  Lookup based on two conditions and take the first value
Lookup based on two conditions and take the first value

Time:10-05

I would look like to look up value in df based on conditions df_col1 <= df2_col1 and df_col2 > df2_col1. If there are multiple values, take the first one. Any hint?

import pandas as pd

df = pd.DataFrame({'day1': [3, 7, 15], 'day2': [20, 15, 11], 'value': [10, 20, 30]})

df2 = pd.DataFrame({'day': [2, 8, 15, 22]})


# find value where day1 <= day and day2 > day:
# 2 is not greater/equal [3, 7, 15], hence 0
# 8 is greater/equal [3, 7] and less than [20, 15], hence [10]
# 15 is greater/equal [3, 7, 15] and less than [20], hence [10]
# 22 is greater/equal [15] but 22 is not less than [20], hence 0


df2['value'] = 1  # [0, 10, 10, 0]

I have the following but cannot move forward.

df2['day'].apply(lambda x: df.query(f"day1 <= {x} & day2 > {x}"))

I suspect the first match is selected with idxmax(). And zero is achieved via fillna(0).

CodePudding user response:

Perfect job for array broadcasting:

# Convert to numpy array, and raise the array by 1 dimension to prepare for the
# broadcasting
day = df2["day"].to_numpy()[:, None]
# Convert day1 and day2 to numpy arrays too
day1, day2 = df[["day1", "day2"]].to_numpy().T

# Compare every row in df1 against every row in df2
mask = (day1 <= day) & (day < day2)
# A match is found if any column on a row is True
match = np.any(mask, axis=1)
# Get the index of the first True column
idx = np.argmax(mask, axis=1)

# Convert the value column to numpy array
value = df["value"].to_numpy()
# When a match is found, return the corresponding value. If not, return 0
df2["value"] = np.where(match, value[idx], 0)

CodePudding user response:

One option is with conditional_join from pyjanitor - specifically the dev version:

# pip install git https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

(df2
.conditional_join(
    df, 
    ('day', 'day1', '>='), 
    ('day', 'day2', '<'), 
    how = 'left', 
    keep = 'first',
    df_columns = 'day',
    right_columns = 'value')
.fillna(0,downcast='infer')
)

   day  value
0    8     10
1   15     10
2    2      0
3   22      0

Under the hood it uses a modified form of binary search, to avoid a cartesian join

  • Related