Find overlapping time intervals based on condition in another column pandas-CodePudding

I have cleaned up a data set to get it into this format. The assigned_pat_loc represents a room number, so I am trying to identify when two different patients (patient_id) are in the same room at the same time; i.e., overlapping start_time and end_time between rows with the same assigned_pat_loc but different patient_id's. The start_time and end_time represent the times that that particular patient was in that room. So if those times are overlapping between two patients in the same room, it means that they shared the room together. This is what I'm ultimately looking for. Here is the base data set from which I want to construct these changes:

      patient_id    assigned_pat_loc    start_time          end_time
0     19035648      SICU^6108           2009-01-10 18:27:48 2009-02-25 15:45:54
1     19039244      85^8520             2009-01-02 06:27:25 2009-01-05 10:38:41
2     19039507      55^5514             2009-01-01 13:25:45 2009-01-01 13:25:45
3     19039555      EIAB^EIAB           2009-01-15 01:56:48 2009-02-23 11:36:34
4     19039559      EIAB^EIAB           2009-01-16 11:24:18 2009-01-19 18:41:33
... ... ... ... ...
140906 46851413     EIAB^EIAB           2011-12-31 22:28:38 2011-12-31 23:15:49
140907 46851422     EIAB^EIAB           2011-12-31 21:52:44 2011-12-31 22:50:08
140908 46851430     4LD^4LDX            2011-12-31 22:41:10 2011-12-31 22:44:48
140909 46851434     EIC^EIC             2011-12-31 23:45:22 2011-12-31 23:45:22
140910 46851437     EIAB^EIAB           2011-12-31 22:54:40 2011-12-31 23:30:10

I am thinking I should approach this with a groupby of some sort, but I'm not sure exactly how to implement. I would show an attempt but it took me about 6 hours to even get to this point so I would appreciate even just some thoughts.

EDIT

Example of original data:

id  Date    Time        assigned_pat_loc    prior_pat_loc   Activity
1   May/31/11   8:00    EIAB^EIAB^6                         Admission
1   May/31/11   9:00    8w^201              EIAB^EIAB^6     Transfer 
1   Jun/8/11    15:00   8w^201                              Discharge
2   May/31/11   5:00    EIAB^EIAB^4                         Admission 
2   May/31/11   7:00    10E^45              EIAB^EIAB^4     Transfer
2   Jun/1/11    1:00    8w^201              10E^45          Transfer
2   Jun/1/11    8:00    8w^201                              Discharge
3   May/31/11   9:00    EIAB^EIAB^2                         Admission
3   Jun/1/11    9:00    8w^201              EIAB^EIAB^2     Transfer
3   Jun/5/11    9:00    8w^201                              Discharge
4   May/31/11   9:00    EIAB^EIAB^9                         Admission
4   May/31/11   7:00    10E^45              EIAB^EIAB^9     Transfer
4   Jun/1/11    8:00    10E^45                              Death

Example of desired output:

id  r_id    start_date  start_time  end_date    end_time    length  location 
1   2       Jun/1/11    1:00        Jun/1/11    8:00        7   8w^201
1   3       Jun/1/11    9:00        Jun/5/11    9:00        96  8w^201
2   4       May/31/11   7:00        Jun/1/11    1:00        18  10E^45
2   1       Jun/1/11    1:00        Jun/1/11    8:00        7   8w^201
3   1       Jun/1/11    9:00        Jun/5/11    9:00        96  8w^201

Where r_id is the "other" patient who is sharing the same room as another one, and length is the number of time in hours that the room was shared.

In this example:

r_id is the name of the variable you will generate for the id of the other patient.
patient 1 had two room-sharing episodes, both in 8w^201 (room 201 of unit 8w); he shared the room with patient 2 for 7 hours (1 am to 8 am on June 1) and with patient 3 for 96 hours (9 am on June 1 to 9 am on June 5).
Patient 2 also had two room sharing episodes. The first one was with patient 4 in in 10E^45 (room 45 of unit 10E) and lasted 18 hours (7 am May 31 to 1 am June 1); the second one is the 7-hour episode with patient 1 in 8w^201.
Patient 3 had only one room-sharing episode with patient 1 in room 8w^201, lasting 96 hours.
Patient 4, also, had only one room-sharing episode, with patient 2 in room 10E^45, lasting 18 hours.
Note: room-sharing episodes are listed twice, once for each patient.

CodePudding user response：

numpy broadcasting is perfect for this. It allows you to compare every record (patient-room) against every other record in the dataframe. The down size is that it's memory intensive, as it requires n^2 * 8 bytes to store the comparison matrix. Glancing over your data with ~141k rows, it will require 148GB of memory!

We need to chunk the dataframe so the memory requirement is reduced to chunk_size * n * 8 bytes.

# Don't keep date and time separately, they are hard to
# perform calculations on. Instead, combine them into a
# single column and keep it as pd.Timestamp
df["start_time"] = pd.to_datetime(df["Date"]   " "   df["Time"])

# I don't know how you determine when a patient vacate a
# room. My logic here is
#   - If Activity = Discharge or Death, end_time = start_time
#   - Otherwise, end_time = start_time of the next room
# You can implement your own logic. This part is not
# essential to the problem at hand.
df["end_time"] = np.where(
    df["Activity"].isin(["Discharge", "Death"]),
    df["start_time"],
    df.groupby("id")["start_time"].shift(-1),
)

# ------------------------------------------------------------------------------

# Extract all the columns to numpy arrays
patient_id, assigned_pat_loc, start_time, end_time = (
    df[["id", "assigned_pat_loc", "start_time", "end_time"]].to_numpy().T
)

chunk_size = 1000 # experiment to find a size that suits you
idx_left = []
idx_right = []

for offset in range(0, len(df), chunk_size):
    chunk = slice(offset, offset   chunk_size)

    # Get a chunk of each array. The [:, None] part is to
    # raise the chunk up one dimension to prepare for numpy
    # broadcasting
    patient_id_chunk, assigned_pat_loc_chunk, start_time_chunk, end_time_chunk = [
        arr[chunk][:, None] for arr in (patient_id, assigned_pat_loc, start_time, end_time)
    ]

    # `mask` is a matrix. If mask[i, j] == True, the patient
    # in row i is sharing the room with the patient in row j
    mask = (
        # patent_id are different
        (patient_id_chunk != patient_id)
        # in the same room
        & (assigned_pat_loc_chunk == assigned_pat_loc)
        # start_time and end_time overlap
        & (start_time_chunk < end_time)
        & (start_time < end_time_chunk)
    )

    idx = mask.nonzero()
    idx_left.extend(idx[0]   offset)
    idx_right.extend(idx[1])

result = pd.concat(
    [
        df[["id", "assigned_pat_loc", "start_time", "end_time"]]
        .iloc[idx]
        .reset_index(drop=True)
        for idx in [idx_left, idx_right]
    ],
    axis=1,
    keys=["patient_1", "patient_2"],
)

Result:

  patient_1                                                          patient_2                                                         
         id assigned_pat_loc          start_time            end_time        id assigned_pat_loc          start_time            end_time
0         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00         2           8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00
1         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00         2           8w^201 2011-06-01 08:00:00 2011-06-01 08:00:00
2         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00         3           8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00
3         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00         3           8w^201 2011-06-05 09:00:00 2011-06-05 09:00:00
4         2           10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00         4           10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00
5         2           8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
6         2           8w^201 2011-06-01 08:00:00 2011-06-01 08:00:00         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
7         3           8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
8         3           8w^201 2011-06-05 09:00:00 2011-06-05 09:00:00         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
9         4           10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00         2           10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00

CodePudding user response：

Another option.

I'm starting wiht the original data after the EDIT, but I have changed this row

4   May/31/11   9:00    EIAB^EIAB^9                         Admission

4   May/31/11   6:00    EIAB^EIAB^9                         Admission

because I think the admission time should be before the transfer time?

The first step is essentially to get a dataframe similiar to the one you're starting out with:

df = (
    df.assign(start_time=pd.to_datetime((df["Date"]   " "   df["Time"])))
    .sort_values(["id", "start_time"])
    .assign(duration=lambda df: -df.groupby("id")["start_time"].diff(-1))
    .loc[lambda df: df["duration"].notna()]
    .assign(end_time=lambda df: df["start_time"]   df["duration"])
    .rename(columns={"assigned_pat_loc": "location"})
    [["id", "location", "start_time", "end_time"]]
)

Result for the sample:

    id     location          start_time            end_time
0    1  EIAB^EIAB^6 2011-05-31 08:00:00 2011-05-31 09:00:00
1    1       8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
3    2  EIAB^EIAB^4 2011-05-31 05:00:00 2011-05-31 07:00:00
4    2       10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00
5    2       8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00
7    3  EIAB^EIAB^2 2011-05-31 09:00:00 2011-06-01 09:00:00
8    3       8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00
10   4  EIAB^EIAB^9 2011-05-31 06:00:00 2011-05-31 07:00:00
11   4       10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00

The next step is merging df with itself on the location column and eliminating the rows where id is the same as r_id:

df = (
    df.merge(df, on="location")
    .rename(columns={"id_x": "id", "id_y": "r_id"})
    .loc[lambda df: df["id"] != df["r_id"]]
)

Then finally get the rows with an actual overlap via m, calculate the duration of the overlap, and bring the dataframe in the form you are looking for:

m = (
    (df["start_time_x"].le(df["start_time_y"])
     & df["start_time_y"].le(df["end_time_x"]))
    | (df["start_time_y"].le(df["start_time_x"])
       & df["start_time_x"].le(df["end_time_y"]))
)
df = (
    df[m]
    .assign(
        start_time=lambda df: df[["start_time_x", "start_time_y"]].max(axis=1),
        end_time=lambda df: df[["end_time_x", "end_time_y"]].min(axis=1),
        duration=lambda df: df["end_time"] - df["start_time"]
    )
    .assign(
        start_date=lambda df: df["start_time"].dt.date,
        start_time=lambda df: df["start_time"].dt.time,
        end_date=lambda df: df["end_time"].dt.date,
        end_time=lambda df: df["end_time"].dt.time
    )
    [[
        "id", "r_id",
        "start_date", "start_time", "end_date", "end_time",
        "duration", "location"
    ]]
    .sort_values(["id", "r_id"]).reset_index(drop=True)
)

Result for the sample:

   id  r_id  start_date start_time    end_date  end_time        duration  \
0   1     2  2011-06-01   01:00:00  2011-06-01  08:00:00 0 days 07:00:00   
1   1     3  2011-06-01   09:00:00  2011-06-05  09:00:00 4 days 00:00:00   
2   2     1  2011-06-01   01:00:00  2011-06-01  08:00:00 0 days 07:00:00   
3   2     4  2011-05-31   07:00:00  2011-06-01  01:00:00 0 days 18:00:00   
4   3     1  2011-06-01   09:00:00  2011-06-05  09:00:00 4 days 00:00:00   
5   4     2  2011-05-31   07:00:00  2011-06-01  01:00:00 0 days 18:00:00   

  location  
0   8w^201  
1   8w^201  
2   8w^201  
3   10E^45  
4   8w^201  
5   10E^45