I have the following dataframe
Sleep Stage Time[hh:mm:ss] Event Duration[s]
0 SLEEP-S0 23:27:14 SLEEP-S0 30
1 SLEEP-S0 23:27:44 SLEEP-S0 30
2 SLEEP-MT 23:28:14 SLEEP-MT 30
3 SLEEP-S0 23:28:44 SLEEP-S0 30
4 SLEEP-S0 23:29:14 SLEEP-S0 30
... ... ... ... ...
1022 SLEEP-REM 07:57:14 SLEEP-REM 30
1023 SLEEP-REM 07:57:44 SLEEP-REM 30
1024 SLEEP-S2 07:58:14 SLEEP-S2 30
1025 SLEEP-S2 07:58:44 SLEEP-S2 30
1026 SLEEP-S2 07:59:14 SLEEP-S2 30
I wanted to create segments that contain the stage, the starting time of that stage and the end time of that stage. Note that the stages may repeat themselves. So for example, I want something like this:
STAGE START_POINT END_POINT DURATION
SLEEP-S0 23:27:14 23:28:14 60
SLEEP-MT 23:28:14 23:28:44 30
SLEEP-S0 ... ... ...
Can someone please help me with it?
CodePudding user response:
Your problem is a variation of the island-and-gap problem. Every time the Stage
changes, it creates a new island. And the end of an island is the start of the next one.
# Every time the Stage changes from the previous row, it creates a new island
islands = df["Stage"].ne(df["Stage"].shift()).rename("island").cumsum()
# Convert the Time column to time for easier calculation
time = pd.to_datetime(df["Time"])
# If left alone "00:01:00" will convert to 12:01 AM *today* but it's most likely
# mean 12:01 AM *the next day*. We need some special handling for time in the
# wee hours.
time = time.mask(time < pd.Timestamp.now().replace(hour=12, minute=0, second=0), time pd.Timedelta(hours=24))
# Each island has the same Stage across all rows
result = time.groupby([islands, df["Stage"]]).agg(START_POINT="min")
# END_POINT is START_POINT of the next stage
result["END_POINT"] = result["START_POINT"].shift(-1)
# DURATION is measured in seconds
result["DURATION"] = (result["END_POINT"] - result["START_POINT"]) / pd.Timedelta(seconds=1)
# Format the START_POINT and END_POINT back to strings
for col in ["START_POINT", "END_POINT"]:
result[col] = result[col].dt.strftime("%H:%M:%S")