Extracting stages/segments from pandas dataframes and the corresponding time stamps-CodePudding

I have the following dataframe

       Sleep Stage  Time[hh:mm:ss]  Event   Duration[s]
    0   SLEEP-S0    23:27:14    SLEEP-S0    30
    1   SLEEP-S0    23:27:44    SLEEP-S0    30
    2   SLEEP-MT    23:28:14    SLEEP-MT    30
    3   SLEEP-S0    23:28:44    SLEEP-S0    30
    4   SLEEP-S0    23:29:14    SLEEP-S0    30
    ... ... ... ... ...
   1022 SLEEP-REM   07:57:14    SLEEP-REM   30
   1023 SLEEP-REM   07:57:44    SLEEP-REM   30
   1024 SLEEP-S2    07:58:14    SLEEP-S2    30
   1025 SLEEP-S2    07:58:44    SLEEP-S2    30
   1026 SLEEP-S2    07:59:14    SLEEP-S2    30

I wanted to create segments that contain the stage, the starting time of that stage and the end time of that stage. Note that the stages may repeat themselves. So for example, I want something like this:

 STAGE        START_POINT        END_POINT      DURATION
SLEEP-S0       23:27:14          23:28:14        60
SLEEP-MT       23:28:14          23:28:44        30    
SLEEP-S0        ...                ...           ...

Can someone please help me with it?

CodePudding user response：

Your problem is a variation of the island-and-gap problem. Every time the Stage changes, it creates a new island. And the end of an island is the start of the next one.

# Every time the Stage changes from the previous row, it creates a new island
islands = df["Stage"].ne(df["Stage"].shift()).rename("island").cumsum()

# Convert the Time column to time for easier calculation
time = pd.to_datetime(df["Time"])

# If left alone "00:01:00" will convert to 12:01 AM *today* but it's most likely
# mean 12:01 AM *the next day*. We need some special handling for time in the
# wee hours.
time = time.mask(time < pd.Timestamp.now().replace(hour=12, minute=0, second=0), time   pd.Timedelta(hours=24))

# Each island has the same Stage across all rows
result = time.groupby([islands, df["Stage"]]).agg(START_POINT="min")

# END_POINT is START_POINT of the next stage
result["END_POINT"] = result["START_POINT"].shift(-1)

# DURATION is measured in seconds
result["DURATION"] = (result["END_POINT"] - result["START_POINT"]) / pd.Timedelta(seconds=1)

# Format the START_POINT and END_POINT back to strings
for col in ["START_POINT", "END_POINT"]:
    result[col] = result[col].dt.strftime("%H:%M:%S")