I have a dataframe where two columns represent the start and end points of intervals on a real number line. I want to generate a third column as a list of the indices of rows which said row has any overlap with. I'm having difficulty creating a inequality boolean matrix for this natively in pandas. I assume logic like this s1<=e2 and e1>=s2
will do the trick, but I don't know how to effectively broadcast it.
As a toy example I'm hoping for a simple way to at least generate a 5x5 boolean matrix (with all True down the diagonal) given this dataframe:
import pandas as pd
intervals_df = pd.DataFrame({"Starts":[0,1,5,10,15,20],"Ends":[4,2,9,14,19,24]})
Starts Ends
0 0 4
1 1 2
2 5 9
3 10 14
4 15 19
5 20 24
CodePudding user response:
The condition for the two intervals (s1,e1)
and (s2,e2)
to intersect is max(s1,s2) <= min(e1,e2)
. So you can do a cross merge (this is the broadcast), calculate the condition, the pivot:
d = (intervals_df.reset_index()
.merge(intervals_df.reset_index(), how='cross')
.assign(cond=lambda x: x.filter(like='Starts').max(axis=1) <= x.filter(like='Ends').min(axis=1))
.pivot('index_x', 'index_y', 'cond')
)
You would get:
index_y 0 1 2 3 4 5
index_x
0 True True False False False False
1 True True False False False False
2 False False True False False False
3 False False False True False False
4 False False False False True False
5 False False False False False True
Or you can make do with numpy's broadcasting:
starts = intervals_df[['Starts']].to_numpy()
ends = intervals_df[['Ends']].to_numpy()
np.maximum(starts, starts.T) <= np.minimum(ends, ends.T)
Output:
array([[ True, True, False, False, False, False],
[ True, True, False, False, False, False],
[False, False, True, False, False, False],
[False, False, False, True, False, False],
[False, False, False, False, True, False],
[False, False, False, False, False, True]])