Is there a vectorized way to find overlap in intervals between 2 dataframes?-CodePudding

I have 2 dataframes: df1, df2. e.g.,

df1:

id   start   end
1    0       15
1    15      30
1    30      45
2    0       15
2    15      30
2    30      45

df2 = 
id   start   end   
1    0       1.1    
1    1.1     11.4  
1    11.4    34    
1    34      46    
2    0       1.5   
2    1.5     20   
2    20      30

For each row in df1, I want to find the rows in df2 such that df1.id == df2.id and there is overlap between the 2 intervals [df1.start, df1.end] and [df2.start, df2.end]. The overlap is defined by the condition: df1.start <= df2.end AND df1.end > df2.start)

So for the above, the result should be

[0, 1, 2]
[2]
[2, 3]
[4, 5]
[5, 6]
[6]

Is there a vectorized way to do this?

The output should be a dataframe (or some other structure) that has length len(df1) where each row are the indices in df2 that fit the aforementioned constraints. df2 is quite large, on the order of millions, and df1 is on the order of 10s of the thousands, so there may be memory concerns. Technically, I only need 1 row at a time to process before moving on to the next row, so I don't necessarily need to store all the rows, but that may ruin the vectorization if I process a row at a time?

Ultimately, I want to use the output indices to compute weight averages. df2 has a 4th column, quantity, and I want to compute its weighted average where the weights are proportional to the amount of overlap. So technically I don't need to store the overlap indices, but I think the memory problem would still persist since they will stilled by stored implicitly?

CodePudding user response：

Cross merge on id, query the rows of interest, then groupby:

(df1.reset_index().merge(df2.reset_index(), on='id', suffixes=['','_x'])
    .query('start < end_x & end > start_x')
    .groupby(['index'])['index_x'].agg(list)
)

Output:

index
0    [0, 1, 2]
1          [2]
2       [2, 3]
3       [4, 5]
4       [5, 6]
Name: index_x, dtype: object

Note: your data size would be a problem for a cross merge. In which case, maybe working with different id in df1 would be much better. Then you can resolve to multithread.

data = []
# isolate data:
for ID, d in df1.groupby('id'):
    overlaps = (d.reset_index().merge(df2.query('id == @ID').reset_index(),
                                      on='id', suffixes=['','_x'])
                 .query('start < end_x & end > start_x')
                 .groupby(['index'])['index_x'].agg(list)
               )
    data.append(overlaps)

pd.concat(data)

Output:

index
0    [0, 1, 2]
1          [2]
2       [2, 3]
3       [4, 5]
4       [5, 6]
Name: index_x, dtype: object

Update per comment:

(df1.reset_index().merge(df2, on='id', suffixes=['','_x'])
    .assign(overlap=lambda x: x[['end','end_x']].min(axis=1) - x[['start','start_x']].max(axis=1))
    .query('overlap > 0')
    .groupby('index')['overlap'].mean()
)

Output:

index
0     5.0
1    15.0
2     7.5
3     7.5
4     7.5
Name: overlap, dtype: float64