How to group a pandas (or dask) dataframe and get the min, max and some operation, only when the diference between the grouped rows are 1 second?
MY DATA:
ID | DT | VALOR |
---|---|---|
1 | 12:01:00 | 7 |
1 | 12:01:01 | 1 |
1 | 12:01:02 | 4 |
1 | 12:01:03 | 3 |
1 | 12:01:08 | 1 |
1 | 12:01:09 | 5 |
2 | 12:01:09 | 6 |
1 | 12:01:10 | 6 |
1 | 12:01:11 | 4 |
RETURN:
ID | MENOR_DT | MAIOR_DT | SOMA |
---|---|---|---|
1 | 12:01:00 | 12:01:03 | 15 |
1 | 12:01:08 | 12:01:11 | 16 |
2 | 12:01:09 | 12:01:09 | 6 |
CodePudding user response:
Try:
df["DT"] = pd.to_timedelta(df["DT"])
tmp = df.groupby("ID", group_keys=False)["DT"].apply(
lambda x: (x.diff().bfill() != "1 second").cumsum()
)
df = (
df.groupby(["ID", tmp])
.agg(
ID=("ID", "first"),
MENOR_DT=("DT", "min"),
MAIOR_DT=("DT", "max"),
SOME=("VALOR", "sum"),
)
.reset_index(drop=True)
)
df["MENOR_DT"] = df["MENOR_DT"].astype(str).str.split().str[-1]
df["MAIOR_DT"] = df["MAIOR_DT"].astype(str).str.split().str[-1]
print(df)
Prints:
ID MENOR_DT MAIOR_DT SOME
0 1 12:01:00 12:01:03 15
1 1 12:01:08 12:01:11 16
2 2 12:01:09 12:01:09 6
CodePudding user response:
df['seq'] = np.nan # create a temp column
# sort the DF, find the seconds difference, and update the seq columns
# ffill to group all rows that has a 1 second or less of difference
df['seq']=(df.sort_values(['ID','DT'])
.assign(seq=df['seq']
.mask(pd.to_timedelta(df['DT']).dt.total_seconds()
.diff().ne(1), 1))['seq']
.cumsum()
.ffill()
)
# groupby ID, seq and take the aggregate
# drop the seq columns
(df.groupby(['ID','seq']).agg(MENOR_DT= ('DT','min'),
MAIOR_DT= ('DT','max'),
SOMA = ('VALOR','sum'))
.reset_index()
.drop(columns=['seq']))
ID MENOR_DT MAIOR_DT SOMA
0 1 12:01:00 12:01:03 15
1 1 12:01:08 12:01:11 16
2 2 12:01:09 12:01:09 6