Home > Net >  Pandas - How to group sequences
Pandas - How to group sequences

Time:11-09

How to group a pandas (or dask) dataframe and get the min, max and some operation, only when the diference between the grouped rows are 1 second?

MY DATA:

ID DT VALOR
1 12:01:00 7
1 12:01:01 1
1 12:01:02 4
1 12:01:03 3
1 12:01:08 1
1 12:01:09 5
2 12:01:09 6
1 12:01:10 6
1 12:01:11 4

RETURN:

ID MENOR_DT MAIOR_DT SOMA
1 12:01:00 12:01:03 15
1 12:01:08 12:01:11 16
2 12:01:09 12:01:09 6

CodePudding user response:

Try:

df["DT"] = pd.to_timedelta(df["DT"])

tmp = df.groupby("ID", group_keys=False)["DT"].apply(
    lambda x: (x.diff().bfill() != "1 second").cumsum()
)

df = (
    df.groupby(["ID", tmp])
    .agg(
        ID=("ID", "first"),
        MENOR_DT=("DT", "min"),
        MAIOR_DT=("DT", "max"),
        SOME=("VALOR", "sum"),
    )
    .reset_index(drop=True)
)
df["MENOR_DT"] = df["MENOR_DT"].astype(str).str.split().str[-1]
df["MAIOR_DT"] = df["MAIOR_DT"].astype(str).str.split().str[-1]
print(df)

Prints:

   ID  MENOR_DT  MAIOR_DT  SOME
0   1  12:01:00  12:01:03    15
1   1  12:01:08  12:01:11    16
2   2  12:01:09  12:01:09     6

CodePudding user response:

df['seq'] = np.nan  # create a temp column

# sort the DF, find the seconds difference, and update the seq columns
# ffill to group all rows that has a 1 second or less of difference

df['seq']=(df.sort_values(['ID','DT'])   
           .assign(seq=df['seq']
                   .mask(pd.to_timedelta(df['DT']).dt.total_seconds()
                         .diff().ne(1), 1))['seq']
           .cumsum()
           .ffill()
)

# groupby ID, seq and take the aggregate
# drop the seq columns

(df.groupby(['ID','seq']).agg(MENOR_DT= ('DT','min'), 
                              MAIOR_DT= ('DT','max'), 
                              SOMA = ('VALOR','sum'))
 .reset_index()
 .drop(columns=['seq']))
    ID  MENOR_DT    MAIOR_DT    SOMA
0   1   12:01:00    12:01:03    15
1   1   12:01:08    12:01:11    16
2   2   12:01:09    12:01:09    6
  • Related