I have two dataframes like this:
df1:
Value | Responsible |
---|---|
11000 | Jack |
21040 | Dylan |
12050 | Jack |
df2:
Start | End |
---|---|
10001 | 20000 |
20001 | 30000 |
Desired output:
Start | End | Responsible |
---|---|---|
10001 | 20000 | Jack |
20001 | 30000 | Dylan |
I need to join the 'Responsible' column in df2, using 'Value' as key that is in the range set by 'Start' and 'End'. Let's define that 'Value' has unique 'Responsible' for each range, but there are repeated values of 'Responsibles' in df1 (again, just one for each range).
Is there an easy way to do that?
CodePudding user response:
First find the smallest Value
that is larger than Start
, then make sure it is smaller than End
:
import pandas as pd
df1 = pd.DataFrame({'Value':[11000,21040,12050], 'Responsible':['Jack', 'Dylan', 'Jack']})
df2 = pd.DataFrame({'Start':[10001,20001], 'End':[20000, 30000]})
df = pd.merge_asof(df2.sort_values('Start'), df1.sort_values('Value'),
left_on = 'Start', right_on = 'Value', direction='forward')
df = df[df['Value']<df['End']].drop(columns = 'Value')
Start End Responsible
0 10001 20000 Jack
1 20001 30000 Dylan
CodePudding user response:
You can generate bins from df2
and then bin your data in df1
based on those.
Then you can sort according to "Value" and drop_duplicates
to select each row who has the minimum value from each bin.
bins = pd.IntervalIndex.from_arrays(df2["Start"], df2["End"])
out = (
df1.assign(bins=pd.cut(df1["Value"],bins=bins))
.sort_values("Value")
.drop_duplicates(subset="bins")
)
print(out)
Value Responsible bins
0 11000 Jack (10001, 20000]
1 21040 Dylan (20001, 30000]
If you would like to perform some post-processing to get your data formatted exactly as your output is you can do:
final = (
out.assign(
start=lambda d: d["bins"].cat.categories.left,
end=lambda d: d["bins"].cat.categories.right
)
.drop(columns=["Value", "bins"])
)
print(final)
Responsible start end
0 Jack 10001 20000
1 Dylan 20001 30000