Pandas. Collapse rows with a certain time difference by summing data in other columns-CodePudding

I have a table with time points (timestamp). If less than a minute has passed between the points, I need to sum the timedelta and update the point's datetime boundaries.

For the rows where 'time_break' is less than 60 I need the following:

sum 'time_delta' with 'time_delta' from the previous row;
update 'ts_start' with the value 'ts_start' from the previous row;
calculate new 'time_break', so it's correct for updated time points

Input data:

df = pd.DataFrame({'ts_start': [1647854644, 1647855323, 1647855454, 1647855521, 1647858807, 1647858858, 1647858970],
                   'ts_end': [1647854699, 1647855421, 1647855521, 1647856205, 1647858810, 1647858958, 1647859020],
                   'time_break': [105.0, 624.0, 33.0, 0.0, 2602.0, 48.0, 12.0],
                   'time_delta': [55, 98, 67, 625, 3, 100, 50]})

Expected output:

df_out = pd.DataFrame({'ts_start': [1647854644, 1647855323, 1647858807],
                   'ts_end': [1647854699, 1647856205, 1647859020],
                   'time_break': [105.0, 624.0, 2602.0],
                   'time_delta': [55, 790, 153]})

But I understand it's working slowly and it misses doubled time_breaks. I think there's a possibility of using groupby, but I can't come up with a solution.

Would appreciate any help, thanks!

I've tried this:

for i in range(1, len(df)):
    try:
        if 0 <= df.iloc[i, df.columns.get_loc('time_break')] <= 60:
            df.iloc[i, df.columns.get_loc('time_delta')]  = df.iloc[i-1, df.columns.get_loc('time_delta')]
            df.iloc[i, df.columns.get_loc('ts_start')] = df.iloc[i-1, df.columns.get_loc('ts_start')]
            df.iloc[i, df.columns.get_loc('time_break')] = df.iloc[i, df.columns.get_loc('ts_start')] - df.iloc[i-2, df.columns.get_loc('ts_end')]
            df.drop(index=i-1, inplace=True)
            df = df.reset_index(drop=True)
    except IndexError:
        break;

My output: enter image description here

CodePudding user response：

Try:

x = (
    df.groupby((df["time_break"] >= 60).cumsum())
    .agg(
        {
            "ts_start": "first",
            "ts_end": "last",
            "time_break": "first",
            "time_delta": "sum",
        }
    )
    .reset_index(drop=True)
)

print(x)

Prints:

     ts_start      ts_end  time_break  time_delta
0  1647854644  1647854699       105.0          55
1  1647855323  1647856205       624.0         790
2  1647858807  1647859020      2602.0         153