Pandas group by continuous date range-CodePudding

I have a pandas dataframe like this: start end value course

    start     end        value  course
0  2022-01-01 2022-01-01 10     first
1  2022-01-02 2022-01-02 20     first
2  2022-01-05 2022-01-05 30     second
3  2022-01-04 2022-01-04 40     second
4  2022-01-08 2022-01-08 21     first
5  2022-01-09 2022-01-09 92     first
6  2022-01-10 2022-01-10 55     first

What's the best way to group it like this:

    start           end         value       course
0   2022-01-01  2022-01-02  10      first
1   2022-01-04  2022-01-05  30      second
2   2022-01-08  2022-01-10  21      first

There might be more rows with particular course, but the idea is how to group first by the course, and second by one continuous date range? Or maybe it's worth to try to slipt by missing date? The closest case is this one, however it didn't help, since I dont have info about frequency of dates to pass into pd.Grouper(), and I also need to keep start column.

CodePudding user response：

A possible solution :

df["start"] = pd.to_datetime(df["start"])
df["end"]   = pd.to_datetime(df["end"])

aggs = {"start": "first", "end": "last", "value": "first"}

out = (
        df
         .sort_values(by=["course", "start"])
         .assign(next_ = lambda x: x.groupby('course')["start"].shift(-1),
                 group = lambda x: ((x["next_"] != x["end"]   pd.Timedelta(days=1))
                                   |(x["next_"].isna())).cumsum())
         .groupby(["course", "group"], as_index=False).agg(aggs)
         .sort_values(by="start", ignore_index=True).drop(columns="group")
          [["start",  "end", "value", "course"]]
      )

Output :

print(out)

       start        end  value  course
0 2022-01-01 2022-01-02     10   first
1 2022-01-04 2022-01-05     30  second
2 2022-01-08 2022-01-10     21   first

CodePudding user response：

You can create virtual subgroup:

# Convert as DatetimeIndex if necessary
# df['start'] = pd.to_datetime(df['start'])
# df['end'] = pd.to_datetime(df['end'])


is_consecutive = lambda x: x['start'].sub(x['end'].shift()).ne('1D')
df['group'] = (df.sort_values(['start', 'end'])
                 .groupby('course').apply(is_consecutive)
                 .cumsum().droplevel('course'))
print(df)

# Output
       start        end  value  course  group
0 2022-01-01 2022-01-01     10   first      1
1 2022-01-02 2022-01-02     20   first      1
2 2022-01-05 2022-01-05     30  second      3
3 2022-01-04 2022-01-04     40  second      3
4 2022-01-08 2022-01-08     21   first      2
5 2022-01-09 2022-01-09     92   first      2
6 2022-01-10 2022-01-10     55   first      2

Now you can do what you want with these groups.