Pandas - How to group sequences-CodePudding

How to group a pandas (or dask) dataframe and get the min, max and some operation, only when the diference between the grouped rows are 1 second?

MY DATA:

ID	DT	VALOR
1	12:01:00	7
1	12:01:01	1
1	12:01:02	4
1	12:01:03	3
1	12:01:08	1
1	12:01:09	5
2	12:01:09	6
1	12:01:10	6
1	12:01:11	4

RETURN:

ID	MENOR_DT	MAIOR_DT	SOMA
1	12:01:00	12:01:03	15
1	12:01:08	12:01:11	16
2	12:01:09	12:01:09	6

CodePudding user response：

Try:

df["DT"] = pd.to_timedelta(df["DT"])

tmp = df.groupby("ID", group_keys=False)["DT"].apply(
    lambda x: (x.diff().bfill() != "1 second").cumsum()
)

df = (
    df.groupby(["ID", tmp])
    .agg(
        ID=("ID", "first"),
        MENOR_DT=("DT", "min"),
        MAIOR_DT=("DT", "max"),
        SOME=("VALOR", "sum"),
    )
    .reset_index(drop=True)
)
df["MENOR_DT"] = df["MENOR_DT"].astype(str).str.split().str[-1]
df["MAIOR_DT"] = df["MAIOR_DT"].astype(str).str.split().str[-1]
print(df)

Prints:

   ID  MENOR_DT  MAIOR_DT  SOME
0   1  12:01:00  12:01:03    15
1   1  12:01:08  12:01:11    16
2   2  12:01:09  12:01:09     6

CodePudding user response：

df['seq'] = np.nan  # create a temp column

# sort the DF, find the seconds difference, and update the seq columns
# ffill to group all rows that has a 1 second or less of difference

df['seq']=(df.sort_values(['ID','DT'])   
           .assign(seq=df['seq']
                   .mask(pd.to_timedelta(df['DT']).dt.total_seconds()
                         .diff().ne(1), 1))['seq']
           .cumsum()
           .ffill()
)

# groupby ID, seq and take the aggregate
# drop the seq columns

(df.groupby(['ID','seq']).agg(MENOR_DT= ('DT','min'), 
                              MAIOR_DT= ('DT','max'), 
                              SOMA = ('VALOR','sum'))
 .reset_index()
 .drop(columns=['seq']))

    ID  MENOR_DT    MAIOR_DT    SOMA
0   1   12:01:00    12:01:03    15
1   1   12:01:08    12:01:11    16
2   2   12:01:09    12:01:09    6