I have a dataframe with two columns that are datetime objects (time_a
and time_b
). I need to check on a row-by-row basis if the elements of time_a
or time_b
for such row, are contained within any of the other intervals defined by the other time_a
and time_b
rows. That's what I defined as 'overlap', if any period of work between time_a
or time_b
clashes partially with other intervals regardless of the room.
The way I managed to approach this initially was to create tuples with the data of time_a
and time_b
, and then checking on a row-by-row basis if time_a
or time_b
fell within any the range of any of these tuples.
That approach seemed convoluted, so I wanted to explore the power of Pandas for such purpose. Using this great question as example, I tried adapting it to my problem, using a dataframe named test_2
(columns are date, room, time_a, time_b, personnel_number
) whilst test_3
only has time_a
, time_b
columns. I wrote my partial solution like this:
any_in_range = lambda row, iterable: any(
[(x > row[2]) & (x < row[3]) for x in iterable])
test_2['label_1'] = test_2.apply(any_in_range, iterable=test_3['time_case_finished'], axis=1)
test_2['label_2'] = test_2.apply(any_in_range, iterable=test_3['time_finished_cleaning'], axis=1)
test_2['isOverlap'] = np.where((test_2['label_1'] == True) | (test_2['label_2'] == True), 1, 0)
final_overlap = test_2[test_2['isOverlap'] == 1]
And a sample of the outcome, is described below:
date room time_a time_b personnel_number label_1 label_2 isOverlap
77 2021-09-14 3 2021-09-14 12:01:42-07:00 2021-09-14 12:12:20-07:00 1 False False 0
80 2021-09-14 1 2021-09-14 13:15:36-07:00 2021-09-14 13:24:50-07:00 1 False False 0
83 2021-09-14 1 2021-09-14 14:21:52-07:00 2021-09-14 14:39:37-07:00 1 True False 1
84 2021-09-14 3 2021-09-14 14:38:58-07:00 2021-09-14 14:52:24-07:00 1 True True 1
90 2021-09-15 4 2021-09-15 09:25:11-07:00 2021-09-15 09:53:33-07:00 1 True True 1
91 2021-09-15 5 2021-09-15 09:28:30-07:00 2021-09-15 09:42:25-07:00 1 False False 0
92 2021-09-15 1 2021-09-15 09:52:18-07:00 2021-09-15 10:07:25-07:00 1 True True 1
93 2021-09-15 3 2021-09-15 10:02:05-07:00 2021-09-15 10:20:13-07:00 1 False True 1
Now, notice how row 90 is marked as 1, but my code fails to find the other row in which it is supposed to be overlapping (which should be row 91, marking a 0). The overlap is not total, even if just a minute, I still want to count it in as overlap, but my code is not fulfilling the purpose for every case in my dataset.
Any help or advice is dearly appreciated.
CodePudding user response:
The problem seems to boil down to finding overlapping intervals, where the intervals are defined by time_a
and time_b
This can be efficiently solved with the piso
(pandas interval set operations) package, in particular the adjacency_matrix method
import pandas as pd
import piso
ii = pd.IntervalIndex.from_arrays(df["time_a"], df["time_b"])
df["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values