How to calculate the duration between rows with the same stage value and then get the cumulative dur-CodePudding

I have the following dataframe:

dt_datetime        stage    proc_val
2011-11-13 11:00   0        20
2011-11-13 11:10   0        21
2011-11-13 11:30   1        25
2011-11-13 11:40   2        22
2011-11-13 11:55   2        28
2011-11-13 12:00   2        29

I need to add a new column called stage_duration and get the following result:

dt_datetime        stage    proc_val   stage_duration
2011-11-13 11:00   0        20         30
2011-11-13 11:10   0        21         30
2011-11-13 11:30   1        25         10
2011-11-13 11:40   2        22         20
2011-11-13 11:55   2        28         20
2011-11-13 12:00   2        29         20

How can I do it?

This is my current code snippet but it does not provide an expected result. It should calculate the duration between rows with the same stage value and then get the cumulative duration of each stage, but it doesn't.

df['stage_duration'] = df.groupby('stage')['dt_datetime'].diff().dt.total_seconds() / 60
df['stage_duration'] = df['stage_duration'].cumsum()

Update:

The solution should also work if the dataframe contains multiple groups of stages, e.g. see stage 0 that starts at 2011-11-13 11:00 and 2011-11-13 12:00. It has different durations in both cases.

dt_datetime        stage    proc_val   stage_duration
2011-11-13 11:00   0        20         30
2011-11-13 11:10   0        21         30
2011-11-13 11:30   1        25         10
2011-11-13 11:40   2        22         20
2011-11-13 11:55   2        28         20
2011-11-13 12:00   2        29         20
2011-11-13 12:00   0        20         70
2011-11-13 13:10   0        21         70

CodePudding user response：

One option:

# ensure datetime
df['dt_datetime'] = pd.to_datetime(df['dt_datetime'])

# get min per group
s = df.groupby('stage')['dt_datetime'].min()

# add last date
s['last'] = df['dt_datetime'].max()

# compute delta and map
df['stage_duration'] = df['stage'].map(s.diff().shift(-1)
                                        .dt.total_seconds().div(60))

Output:

          dt_datetime  stage  proc_val  stage_duration
0 2011-11-13 11:00:00      0        20            30.0
1 2011-11-13 11:10:00      0        21            30.0
2 2011-11-13 11:30:00      1        25            10.0
3 2011-11-13 11:40:00      2        22            20.0
4 2011-11-13 11:55:00      2        28            20.0
5 2011-11-13 12:00:00      2        29            20.0

successive groups

# ensure datetime
df['dt_datetime'] = pd.to_datetime(df['dt_datetime'])

# group by successive values
group = df['stage'].ne(df['stage'].shift()).cumsum()

# get min per group
s = df.groupby(group)['dt_datetime'].min()
s['last'] = df['dt_datetime'].max()

# compute delta and map
df['stage_duration'] = group.map(s.diff().shift(-1).dt.total_seconds().div(60))

Output:

          dt_datetime  stage  proc_val  stage_duration
0 2011-11-13 11:00:00      0        20            30.0
1 2011-11-13 11:10:00      0        21            30.0
2 2011-11-13 11:30:00      1        25            10.0
3 2011-11-13 11:40:00      2        22            20.0
4 2011-11-13 11:55:00      2        28            20.0
5 2011-11-13 12:00:00      2        29            20.0
6 2011-11-13 12:00:00      0        20            70.0
7 2011-11-13 13:10:00      0        21            70.0

CodePudding user response：

Assuming you have a dataframe with a column for stage and a column for timestamp, you can use the following code to calculate the duration between rows with the same stage value and then get the cumulative duration of each stage:

# Create an empty dataframe to store the results
results_df = pd.DataFrame(columns=['stage', 'duration'])

# Iterate through each stage
for stage in df['stage'].unique():
    # Get the rows with the same stage
    stage_df = df[df['stage'] == stage]
    # Calculate the duration between rows
    stage_df['duration'] = stage_df['timestamp'].diff()
    # Calculate the cumulative duration
    stage_df['duration'] = stage_df['duration'].cumsum()
    # Append the results to the results dataframe
    results_df = results_df.append(stage_df[['stage', 'duration']])

# Print the results
print(results_df)