Home > Net >  Pandas GroupBy on consecutive blocks of rows
Pandas GroupBy on consecutive blocks of rows

Time:02-01

How do you groupby on consecutive blocks of rows where each block is separated by a threshold value?

I have the following sample pandas dataframe, and I'm having difficulty getting blocks of rows whose difference in their dates is greater than 365 days.

Date Data
2019-01-01 A
2019-05-01 B
2020-04-01 C
2021-07-01 D
2022-02-01 E
2024-05-01 F

The output I'm looking for is the following,

Min Date Max Date Data
2019-01-01 2020-04-01 ABC
2021-07-01 2022-02-01 DE
2024-05-01 2024-05-01 F

I was looking at pandas .diff() and .cumsum() for getting the number of days between two rows and filtering for rows with difference > 365 days, however, it doesn't work when the dataframe has multiple blocks of rows.

CodePudding user response:

I would also suggest .diff() and .cumsum():

import pandas as pd

df = pd.read_clipboard()
df["Date"] = pd.to_datetime(df["Date"])

blocks = df["Date"].diff().gt("365D").cumsum()
out = df.groupby(blocks).agg({"Date": ["min", "max"], "Data": "sum"})

out:

           Date            Data
            min        max  sum
Date
0    2019-01-01 2019-05-01   AB
1    2020-06-01 2020-06-01    C
2    2021-07-01 2022-02-01   DE
3    2024-05-01 2024-05-01    F

after which you can replace the column labels (now a 2 level MultiIndex) as appropriate.

The date belonging to data "C" is more than 365 days apart from both "B" and "D", so it got its own group. Or am I misunderstanding your expected output?

  • Related