I have 2 dataframes: df1, df2
.
e.g.,
df1:
id start end
1 0 15
1 15 30
1 30 45
2 0 15
2 15 30
2 30 45
df2 =
id start end
1 0 1.1
1 1.1 11.4
1 11.4 34
1 34 46
2 0 1.5
2 1.5 20
2 20 30
For each row in df1
, I want to find the rows in df2
such that df1.id == df2.id
and there is overlap between the 2 intervals [df1.start, df1.end]
and [df2.start, df2.end]
. The overlap is defined by the condition: df1.start <= df2.end AND df1.end > df2.start)
So for the above, the result should be
[0, 1, 2]
[2]
[2, 3]
[4, 5]
[5, 6]
[6]
Is there a vectorized way to do this?
The output should be a dataframe (or some other structure) that has length len(df1)
where each row are the indices in df2
that fit the aforementioned constraints. df2
is quite large, on the order of millions, and df1
is on the order of 10s of the thousands, so there may be memory concerns. Technically, I only need 1 row at a time to process before moving on to the next row, so I don't necessarily need to store all the rows, but that may ruin the vectorization if I process a row at a time?
Ultimately, I want to use the output indices to compute weight averages. df2
has a 4th column, quantity
, and I want to compute its weighted average where the weights are proportional to the amount of overlap. So technically I don't need to store the overlap indices, but I think the memory problem would still persist since they will stilled by stored implicitly?
CodePudding user response:
Cross merge on id
, query
the rows of interest, then groupby
:
(df1.reset_index().merge(df2.reset_index(), on='id', suffixes=['','_x'])
.query('start < end_x & end > start_x')
.groupby(['index'])['index_x'].agg(list)
)
Output:
index
0 [0, 1, 2]
1 [2]
2 [2, 3]
3 [4, 5]
4 [5, 6]
Name: index_x, dtype: object
Note: your data size would be a problem for a cross merge. In which case, maybe working with different id
in df1
would be much better. Then you can resolve to multithread.
data = []
# isolate data:
for ID, d in df1.groupby('id'):
overlaps = (d.reset_index().merge(df2.query('id == @ID').reset_index(),
on='id', suffixes=['','_x'])
.query('start < end_x & end > start_x')
.groupby(['index'])['index_x'].agg(list)
)
data.append(overlaps)
pd.concat(data)
Output:
index
0 [0, 1, 2]
1 [2]
2 [2, 3]
3 [4, 5]
4 [5, 6]
Name: index_x, dtype: object
Update per comment:
(df1.reset_index().merge(df2, on='id', suffixes=['','_x'])
.assign(overlap=lambda x: x[['end','end_x']].min(axis=1) - x[['start','start_x']].max(axis=1))
.query('overlap > 0')
.groupby('index')['overlap'].mean()
)
Output:
index
0 5.0
1 15.0
2 7.5
3 7.5
4 7.5
Name: overlap, dtype: float64