Home > Enterprise >  Join two dataframes by range and values
Join two dataframes by range and values

Time:03-10

I have two dataframes like this:

df1:

Value Responsible
11000 Jack
21040 Dylan
12050 Jack

df2:

Start End
10001 20000
20001 30000

Desired output:

Start End Responsible
10001 20000 Jack
20001 30000 Dylan

I need to join the 'Responsible' column in df2, using 'Value' as key that is in the range set by 'Start' and 'End'. Let's define that 'Value' has unique 'Responsible' for each range, but there are repeated values of 'Responsibles' in df1 (again, just one for each range).

Is there an easy way to do that?

CodePudding user response:

First find the smallest Value that is larger than Start, then make sure it is smaller than End:

import pandas as pd
df1 = pd.DataFrame({'Value':[11000,21040,12050], 'Responsible':['Jack', 'Dylan', 'Jack']})
df2 = pd.DataFrame({'Start':[10001,20001], 'End':[20000, 30000]})

df = pd.merge_asof(df2.sort_values('Start'), df1.sort_values('Value'),
                   left_on = 'Start', right_on = 'Value', direction='forward')
df = df[df['Value']<df['End']].drop(columns = 'Value')
    Start   End     Responsible
0   10001   20000   Jack
1   20001   30000   Dylan

CodePudding user response:

You can generate bins from df2 and then bin your data in df1 based on those.

Then you can sort according to "Value" and drop_duplicates to select each row who has the minimum value from each bin.

bins = pd.IntervalIndex.from_arrays(df2["Start"], df2["End"])
out = (
    df1.assign(bins=pd.cut(df1["Value"],bins=bins))
    .sort_values("Value")
    .drop_duplicates(subset="bins")
)

print(out)
   Value Responsible            bins
0  11000        Jack  (10001, 20000]
1  21040       Dylan  (20001, 30000]

If you would like to perform some post-processing to get your data formatted exactly as your output is you can do:

final = (
    out.assign(
        start=lambda d: d["bins"].cat.categories.left,
        end=lambda d: d["bins"].cat.categories.right
    )
    .drop(columns=["Value", "bins"])
)

print(final)
  Responsible  start    end
0        Jack  10001  20000
1       Dylan  20001  30000

  • Related