How do you groupby on consecutive blocks of rows where each block is separated by a threshold value?
I have the following sample pandas dataframe, and I'm having difficulty getting blocks of rows whose difference in their dates is greater than 365 days.
Date | Data |
---|---|
2019-01-01 | A |
2019-05-01 | B |
2020-04-01 | C |
2021-07-01 | D |
2022-02-01 | E |
2024-05-01 | F |
The output I'm looking for is the following,
Min Date | Max Date | Data |
---|---|---|
2019-01-01 | 2020-04-01 | ABC |
2021-07-01 | 2022-02-01 | DE |
2024-05-01 | 2024-05-01 | F |
I was looking at pandas .diff()
and .cumsum()
for getting the number of days between two rows and filtering for rows with difference > 365 days, however, it doesn't work when the dataframe has multiple blocks of rows.
CodePudding user response:
I would also suggest .diff()
and .cumsum()
:
import pandas as pd
df = pd.read_clipboard()
df["Date"] = pd.to_datetime(df["Date"])
blocks = df["Date"].diff().gt("365D").cumsum()
out = df.groupby(blocks).agg({"Date": ["min", "max"], "Data": "sum"})
out:
Date Data
min max sum
Date
0 2019-01-01 2019-05-01 AB
1 2020-06-01 2020-06-01 C
2 2021-07-01 2022-02-01 DE
3 2024-05-01 2024-05-01 F
after which you can replace the column labels (now a 2 level MultiIndex) as appropriate.
The date belonging to data "C" is more than 365 days apart from both "B" and "D", so it got its own group. Or am I misunderstanding your expected output?