I have cleaned up a data set to get it into this format. The assigned_pat_loc
represents a room number, so I am trying to identify when two different patients (patient_id
) are in the same room at the same time; i.e., overlapping start_time
and end_time
between rows with the same assigned_pat_loc
but different patient_id
's. The start_time
and end_time
represent the times that that particular patient was in that room. So if those times are overlapping between two patients in the same room, it means that they shared the room together. This is what I'm ultimately looking for. Here is the base data set from which I want to construct these changes:
patient_id assigned_pat_loc start_time end_time
0 19035648 SICU^6108 2009-01-10 18:27:48 2009-02-25 15:45:54
1 19039244 85^8520 2009-01-02 06:27:25 2009-01-05 10:38:41
2 19039507 55^5514 2009-01-01 13:25:45 2009-01-01 13:25:45
3 19039555 EIAB^EIAB 2009-01-15 01:56:48 2009-02-23 11:36:34
4 19039559 EIAB^EIAB 2009-01-16 11:24:18 2009-01-19 18:41:33
... ... ... ... ...
140906 46851413 EIAB^EIAB 2011-12-31 22:28:38 2011-12-31 23:15:49
140907 46851422 EIAB^EIAB 2011-12-31 21:52:44 2011-12-31 22:50:08
140908 46851430 4LD^4LDX 2011-12-31 22:41:10 2011-12-31 22:44:48
140909 46851434 EIC^EIC 2011-12-31 23:45:22 2011-12-31 23:45:22
140910 46851437 EIAB^EIAB 2011-12-31 22:54:40 2011-12-31 23:30:10
I am thinking I should approach this with a groupby of some sort, but I'm not sure exactly how to implement. I would show an attempt but it took me about 6 hours to even get to this point so I would appreciate even just some thoughts.
EDIT
Example of original data:
id Date Time assigned_pat_loc prior_pat_loc Activity
1 May/31/11 8:00 EIAB^EIAB^6 Admission
1 May/31/11 9:00 8w^201 EIAB^EIAB^6 Transfer
1 Jun/8/11 15:00 8w^201 Discharge
2 May/31/11 5:00 EIAB^EIAB^4 Admission
2 May/31/11 7:00 10E^45 EIAB^EIAB^4 Transfer
2 Jun/1/11 1:00 8w^201 10E^45 Transfer
2 Jun/1/11 8:00 8w^201 Discharge
3 May/31/11 9:00 EIAB^EIAB^2 Admission
3 Jun/1/11 9:00 8w^201 EIAB^EIAB^2 Transfer
3 Jun/5/11 9:00 8w^201 Discharge
4 May/31/11 9:00 EIAB^EIAB^9 Admission
4 May/31/11 7:00 10E^45 EIAB^EIAB^9 Transfer
4 Jun/1/11 8:00 10E^45 Death
Example of desired output:
id r_id start_date start_time end_date end_time length location
1 2 Jun/1/11 1:00 Jun/1/11 8:00 7 8w^201
1 3 Jun/1/11 9:00 Jun/5/11 9:00 96 8w^201
2 4 May/31/11 7:00 Jun/1/11 1:00 18 10E^45
2 1 Jun/1/11 1:00 Jun/1/11 8:00 7 8w^201
3 1 Jun/1/11 9:00 Jun/5/11 9:00 96 8w^201
Where r_id
is the "other" patient who is sharing the same room as another one, and length
is the number of time in hours that the room was shared.
In this example:
- r_id is the name of the variable you will generate for the id of the other patient.
- patient 1 had two room-sharing episodes, both in 8w^201 (room 201 of unit 8w); he shared the room with patient 2 for 7 hours (1 am to 8 am on June 1) and with patient 3 for 96 hours (9 am on June 1 to 9 am on June 5).
- Patient 2 also had two room sharing episodes. The first one was with patient 4 in in 10E^45 (room 45 of unit 10E) and lasted 18 hours (7 am May 31 to 1 am June 1); the second one is the 7-hour episode with patient 1 in 8w^201.
- Patient 3 had only one room-sharing episode with patient 1 in room 8w^201, lasting 96 hours.
- Patient 4, also, had only one room-sharing episode, with patient 2 in room 10E^45, lasting 18 hours.
- Note: room-sharing episodes are listed twice, once for each patient.
CodePudding user response:
numpy broadcasting is perfect for this. It allows you to compare every record (patient-room) against every other record in the dataframe. The down size is that it's memory intensive, as it requires n^2 * 8
bytes to store the comparison matrix. Glancing over your data with ~141k rows, it will require 148GB of memory!
We need to chunk the dataframe so the memory requirement is reduced to chunk_size * n * 8
bytes.
# Don't keep date and time separately, they are hard to
# perform calculations on. Instead, combine them into a
# single column and keep it as pd.Timestamp
df["start_time"] = pd.to_datetime(df["Date"] " " df["Time"])
# I don't know how you determine when a patient vacate a
# room. My logic here is
# - If Activity = Discharge or Death, end_time = start_time
# - Otherwise, end_time = start_time of the next room
# You can implement your own logic. This part is not
# essential to the problem at hand.
df["end_time"] = np.where(
df["Activity"].isin(["Discharge", "Death"]),
df["start_time"],
df.groupby("id")["start_time"].shift(-1),
)
# ------------------------------------------------------------------------------
# Extract all the columns to numpy arrays
patient_id, assigned_pat_loc, start_time, end_time = (
df[["id", "assigned_pat_loc", "start_time", "end_time"]].to_numpy().T
)
chunk_size = 1000 # experiment to find a size that suits you
idx_left = []
idx_right = []
for offset in range(0, len(df), chunk_size):
chunk = slice(offset, offset chunk_size)
# Get a chunk of each array. The [:, None] part is to
# raise the chunk up one dimension to prepare for numpy
# broadcasting
patient_id_chunk, assigned_pat_loc_chunk, start_time_chunk, end_time_chunk = [
arr[chunk][:, None] for arr in (patient_id, assigned_pat_loc, start_time, end_time)
]
# `mask` is a matrix. If mask[i, j] == True, the patient
# in row i is sharing the room with the patient in row j
mask = (
# patent_id are different
(patient_id_chunk != patient_id)
# in the same room
& (assigned_pat_loc_chunk == assigned_pat_loc)
# start_time and end_time overlap
& (start_time_chunk < end_time)
& (start_time < end_time_chunk)
)
idx = mask.nonzero()
idx_left.extend(idx[0] offset)
idx_right.extend(idx[1])
result = pd.concat(
[
df[["id", "assigned_pat_loc", "start_time", "end_time"]]
.iloc[idx]
.reset_index(drop=True)
for idx in [idx_left, idx_right]
],
axis=1,
keys=["patient_1", "patient_2"],
)
Result:
patient_1 patient_2
id assigned_pat_loc start_time end_time id assigned_pat_loc start_time end_time
0 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 2 8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00
1 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 2 8w^201 2011-06-01 08:00:00 2011-06-01 08:00:00
2 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 3 8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00
3 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 3 8w^201 2011-06-05 09:00:00 2011-06-05 09:00:00
4 2 10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00 4 10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00
5 2 8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
6 2 8w^201 2011-06-01 08:00:00 2011-06-01 08:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
7 3 8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
8 3 8w^201 2011-06-05 09:00:00 2011-06-05 09:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
9 4 10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00 2 10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00
CodePudding user response:
Another option.
I'm starting wiht the original data after the EDIT, but I have changed this row
4 May/31/11 9:00 EIAB^EIAB^9 Admission
to
4 May/31/11 6:00 EIAB^EIAB^9 Admission
because I think the admission time should be before the transfer time?
The first step is essentially to get a dataframe similiar to the one you're starting out with:
df = (
df.assign(start_time=pd.to_datetime((df["Date"] " " df["Time"])))
.sort_values(["id", "start_time"])
.assign(duration=lambda df: -df.groupby("id")["start_time"].diff(-1))
.loc[lambda df: df["duration"].notna()]
.assign(end_time=lambda df: df["start_time"] df["duration"])
.rename(columns={"assigned_pat_loc": "location"})
[["id", "location", "start_time", "end_time"]]
)
Result for the sample:
id location start_time end_time
0 1 EIAB^EIAB^6 2011-05-31 08:00:00 2011-05-31 09:00:00
1 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
3 2 EIAB^EIAB^4 2011-05-31 05:00:00 2011-05-31 07:00:00
4 2 10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00
5 2 8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00
7 3 EIAB^EIAB^2 2011-05-31 09:00:00 2011-06-01 09:00:00
8 3 8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00
10 4 EIAB^EIAB^9 2011-05-31 06:00:00 2011-05-31 07:00:00
11 4 10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00
The next step is merging df
with itself on the location
column and eliminating the rows where id
is the same as r_id
:
df = (
df.merge(df, on="location")
.rename(columns={"id_x": "id", "id_y": "r_id"})
.loc[lambda df: df["id"] != df["r_id"]]
)
Then finally get the rows with an actual overlap via m
, calculate the duration of the overlap, and bring the dataframe in the form you are looking for:
m = (
(df["start_time_x"].le(df["start_time_y"])
& df["start_time_y"].le(df["end_time_x"]))
| (df["start_time_y"].le(df["start_time_x"])
& df["start_time_x"].le(df["end_time_y"]))
)
df = (
df[m]
.assign(
start_time=lambda df: df[["start_time_x", "start_time_y"]].max(axis=1),
end_time=lambda df: df[["end_time_x", "end_time_y"]].min(axis=1),
duration=lambda df: df["end_time"] - df["start_time"]
)
.assign(
start_date=lambda df: df["start_time"].dt.date,
start_time=lambda df: df["start_time"].dt.time,
end_date=lambda df: df["end_time"].dt.date,
end_time=lambda df: df["end_time"].dt.time
)
[[
"id", "r_id",
"start_date", "start_time", "end_date", "end_time",
"duration", "location"
]]
.sort_values(["id", "r_id"]).reset_index(drop=True)
)
Result for the sample:
id r_id start_date start_time end_date end_time duration \
0 1 2 2011-06-01 01:00:00 2011-06-01 08:00:00 0 days 07:00:00
1 1 3 2011-06-01 09:00:00 2011-06-05 09:00:00 4 days 00:00:00
2 2 1 2011-06-01 01:00:00 2011-06-01 08:00:00 0 days 07:00:00
3 2 4 2011-05-31 07:00:00 2011-06-01 01:00:00 0 days 18:00:00
4 3 1 2011-06-01 09:00:00 2011-06-05 09:00:00 4 days 00:00:00
5 4 2 2011-05-31 07:00:00 2011-06-01 01:00:00 0 days 18:00:00
location
0 8w^201
1 8w^201
2 8w^201
3 10E^45
4 8w^201
5 10E^45