Join two dataframes by range and values-CodePudding

I have two dataframes like this:

df1:

Value	Responsible
11000	Jack
21040	Dylan
12050	Jack

df2:

Start	End
10001	20000
20001	30000

Desired output:

Start	End	Responsible
10001	20000	Jack
20001	30000	Dylan

I need to join the 'Responsible' column in df2, using 'Value' as key that is in the range set by 'Start' and 'End'. Let's define that 'Value' has unique 'Responsible' for each range, but there are repeated values of 'Responsibles' in df1 (again, just one for each range).

Is there an easy way to do that?

CodePudding user response：

First find the smallest Value that is larger than Start, then make sure it is smaller than End:

import pandas as pd
df1 = pd.DataFrame({'Value':[11000,21040,12050], 'Responsible':['Jack', 'Dylan', 'Jack']})
df2 = pd.DataFrame({'Start':[10001,20001], 'End':[20000, 30000]})

df = pd.merge_asof(df2.sort_values('Start'), df1.sort_values('Value'),
                   left_on = 'Start', right_on = 'Value', direction='forward')
df = df[df['Value']<df['End']].drop(columns = 'Value')

    Start   End     Responsible
0   10001   20000   Jack
1   20001   30000   Dylan

CodePudding user response：

You can generate bins from df2 and then bin your data in df1 based on those.

Then you can sort according to "Value" and drop_duplicates to select each row who has the minimum value from each bin.

bins = pd.IntervalIndex.from_arrays(df2["Start"], df2["End"])
out = (
    df1.assign(bins=pd.cut(df1["Value"],bins=bins))
    .sort_values("Value")
    .drop_duplicates(subset="bins")
)

print(out)
   Value Responsible            bins
0  11000        Jack  (10001, 20000]
1  21040       Dylan  (20001, 30000]

If you would like to perform some post-processing to get your data formatted exactly as your output is you can do:

final = (
    out.assign(
        start=lambda d: d["bins"].cat.categories.left,
        end=lambda d: d["bins"].cat.categories.right
    )
    .drop(columns=["Value", "bins"])
)

print(final)
  Responsible  start    end
0        Jack  10001  20000
1       Dylan  20001  30000