Map range of columns based on another range in pandas-CodePudding

I have two datasets (df1 and df2) of values with a certain range (Start and End) in both of them.

I would like to annotate the first one (df1) with the range of values (Start/End) on df2.

Example: The first row in df1 ranges from 0-2300000, this since 2300000 is lower than the End in first row in df2 would be annotated with Num 3, and also the row 2 of df1.

However, in the case of the last row of df1, the range contains two rows from df2, so there needs to output the sum in Num in the two rows of df2.

The desired output would be df3

df1.head()

|Start    |End      |Tag    |
|---------|---------|-------|
|0        |2300000  |gneg45 |   
|2300000  |5400000  |gpos25 |
|143541857|200000000|gneg34 |

df2.head()

| Start   |   End   |  Num   |
|---------|---------|--------|
|62920    |121705338|  3     |   
|143541857|147901334|  2     |
|147901760|151020217|  5     |


df3 = 

|Start    |End      |Num    |
|---------|---------|-------|
|0        |2300000  |3      |   
|2300000  |5400000  |3      |
|143541857|200000000|7      |

I tried panda merge creating a key and query based on a range of columns, but nothing really worked.

Thanks in advance!!

CodePudding user response：

From your description, you are looking for overlapping range in df1 and df2 in order for df1 to take the Num from df2.

To formulate the condition of overlapping range condition, let's illustrate as follows the condition of non-overlapping range:

Either:

                                         |<-------------->| 
                                      df2.Start       .df2.End
           |<------------->| 
        df1.Start       df1.End

or:

                 |<-------------->| 
              df2.Start       .df2.End
                                             |<------------->| 
                                          df1.Start       df1.End

This non-overlapping range condition can be formulated as:

Either (df1.End < df2.Start) or (df1.Start > df2.End)

Therefore, the overlapping range condition is the negation of the above conditions, that is:

~ ((df1.End < df2.Start) | (df1.Start > df2.End))

which is equivalent to:

(df1.End >= df2.Start) & (df1.Start <= df2.End)

Solution 1: Simple Solution for small dataset

Step 1: For small dataset, you can cross join df1 and df2 by .merge(), then filter by the overlapping condition using .query(), as follows:

df3 = (df1.merge(df2, how='cross', suffixes=('_df1', '_df2'))
          .query('(End_df1 >= Start_df2) & (Start_df1 <= End_df2)')
          .rename({'Start_df1': 'Start', 'End_df1': 'End'}, axis=1)
          [['Start', 'End', 'Num']]
      )

If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross', you can use:

df3 = (df1.assign(key=1).merge(df2.assign(key=1), on='key', suffixes=('_df1', '_df2')).drop('key', axis=1)
          .query('(End_df1 >= Start_df2) & (Start_df1 <= End_df2)')
          .rename({'Start_df1': 'Start', 'End_df1': 'End'}, axis=1)
          [['Start', 'End', 'Num']]
      )

Intermediate result:

print(df3)

       Start        End  Num
0          0    2300000    3
3    2300000    5400000    3
7  143541857  200000000    2
8  143541857  200000000    5

Step 2: Sum up the Num values for same range (same Start and End) by .groupby() and .sum():

df3 = df3.groupby(['Start', 'End'])['Num'].sum().reset_index()

Result:

print(df3)

       Start        End  Num
0          0    2300000    3
1    2300000    5400000    3
2  143541857  200000000    7

Solution 2: Numpy Solution for large dataset

For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:

Step 1:

d1_S = df1.Start.to_numpy()
d1_E = df1.End.to_numpy()
d2_S = df2.Start.to_numpy()
d2_E = df2.End.to_numpy()

# filter for overlapping range condition and get the respective row indexes of `df1`, `df2` in `i` and `j`
i, j = np.where((d1_E[:, None] >= d2_S) & (d1_S[:, None] <= d2_E))

df3 = pd.DataFrame(
          np.column_stack([df1.values[i], df2.values[j]]),
          columns=df1.columns.append(df2.columns   '_df2')
      )

Intermediate result:

print(df3)

       Start        End     Tag  Start_df2    End_df2 Num_df2
0          0    2300000  gneg45      62920  121705338       3
1    2300000    5400000  gpos25      62920  121705338       3
2  143541857  200000000  gneg34  143541857  147901334       2
3  143541857  200000000  gneg34  147901760  151020217       5

Step 2: Sum up the Num values for same range (same Start and End) by .groupby() and .sum():

df3 = df3.groupby(['Start', 'End'])['Num_df2'].sum().reset_index(name='Num')

Result:

print(df3)

       Start        End  Num
0          0    2300000    3
1    2300000    5400000    3
2  143541857  200000000    7