Home > Software design >  Checking overlaps between two columns of datetime type in Pandas DataFrame
Checking overlaps between two columns of datetime type in Pandas DataFrame

Time:04-25

I have a dataframe with two columns that are datetime objects (time_a and time_b). I need to check on a row-by-row basis if the elements of time_a or time_b for such row, are contained within any of the other intervals defined by the other time_a and time_b rows. That's what I defined as 'overlap', if any period of work between time_a or time_b clashes partially with other intervals regardless of the room.

The way I managed to approach this initially was to create tuples with the data of time_a and time_b, and then checking on a row-by-row basis if time_a or time_b fell within any the range of any of these tuples.

That approach seemed convoluted, so I wanted to explore the power of Pandas for such purpose. Using this great question as example, I tried adapting it to my problem, using a dataframe named test_2 (columns are date, room, time_a, time_b, personnel_number) whilst test_3 only has time_a, time_b columns. I wrote my partial solution like this:

any_in_range = lambda row, iterable: any(
    [(x > row[2]) & (x < row[3]) for x in iterable])
test_2['label_1'] = test_2.apply(any_in_range, iterable=test_3['time_case_finished'], axis=1)
test_2['label_2'] = test_2.apply(any_in_range, iterable=test_3['time_finished_cleaning'], axis=1)
test_2['isOverlap'] = np.where((test_2['label_1'] == True) | (test_2['label_2'] == True), 1, 0)
final_overlap = test_2[test_2['isOverlap'] == 1]

And a sample of the outcome, is described below:

    date    room    time_a  time_b  personnel_number    label_1 label_2 isOverlap
77  2021-09-14  3   2021-09-14 12:01:42-07:00   2021-09-14 12:12:20-07:00   1   False   False   0
80  2021-09-14  1   2021-09-14 13:15:36-07:00   2021-09-14 13:24:50-07:00   1   False   False   0
83  2021-09-14  1   2021-09-14 14:21:52-07:00   2021-09-14 14:39:37-07:00   1   True    False   1
84  2021-09-14  3   2021-09-14 14:38:58-07:00   2021-09-14 14:52:24-07:00   1   True    True    1
90  2021-09-15  4   2021-09-15 09:25:11-07:00   2021-09-15 09:53:33-07:00   1   True    True    1
91  2021-09-15  5   2021-09-15 09:28:30-07:00   2021-09-15 09:42:25-07:00   1   False   False   0
92  2021-09-15  1   2021-09-15 09:52:18-07:00   2021-09-15 10:07:25-07:00   1   True    True    1
93  2021-09-15  3   2021-09-15 10:02:05-07:00   2021-09-15 10:20:13-07:00   1   False   True    1

Now, notice how row 90 is marked as 1, but my code fails to find the other row in which it is supposed to be overlapping (which should be row 91, marking a 0). The overlap is not total, even if just a minute, I still want to count it in as overlap, but my code is not fulfilling the purpose for every case in my dataset.

Any help or advice is dearly appreciated.

CodePudding user response:

The problem seems to boil down to finding overlapping intervals, where the intervals are defined by time_a and time_b

This can be efficiently solved with the piso (pandas interval set operations) package, in particular the adjacency_matrix method

import pandas as pd
import piso

ii = pd.IntervalIndex.from_arrays(df["time_a"], df["time_b"])
df["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values
  • Related