I have a pandas dataframe like this: start end value course
start end value course
0 2022-01-01 2022-01-01 10 first
1 2022-01-02 2022-01-02 20 first
2 2022-01-05 2022-01-05 30 second
3 2022-01-04 2022-01-04 40 second
4 2022-01-08 2022-01-08 21 first
5 2022-01-09 2022-01-09 92 first
6 2022-01-10 2022-01-10 55 first
What's the best way to group it like this:
start end value course
0 2022-01-01 2022-01-02 10 first
1 2022-01-04 2022-01-05 30 second
2 2022-01-08 2022-01-10 21 first
There might be more rows with particular course, but the idea is how to group first by the course, and second by one continuous date range? Or maybe it's worth to try to slipt by missing date? The closest case is this one, however it didn't help, since I dont have info about frequency of dates to pass into pd.Grouper(), and I also need to keep start column.
CodePudding user response:
A possible solution :
df["start"] = pd.to_datetime(df["start"])
df["end"] = pd.to_datetime(df["end"])
aggs = {"start": "first", "end": "last", "value": "first"}
out = (
df
.sort_values(by=["course", "start"])
.assign(next_ = lambda x: x.groupby('course')["start"].shift(-1),
group = lambda x: ((x["next_"] != x["end"] pd.Timedelta(days=1))
|(x["next_"].isna())).cumsum())
.groupby(["course", "group"], as_index=False).agg(aggs)
.sort_values(by="start", ignore_index=True).drop(columns="group")
[["start", "end", "value", "course"]]
)
Output :
print(out)
start end value course
0 2022-01-01 2022-01-02 10 first
1 2022-01-04 2022-01-05 30 second
2 2022-01-08 2022-01-10 21 first
CodePudding user response:
You can create virtual subgroup:
# Convert as DatetimeIndex if necessary
# df['start'] = pd.to_datetime(df['start'])
# df['end'] = pd.to_datetime(df['end'])
is_consecutive = lambda x: x['start'].sub(x['end'].shift()).ne('1D')
df['group'] = (df.sort_values(['start', 'end'])
.groupby('course').apply(is_consecutive)
.cumsum().droplevel('course'))
print(df)
# Output
start end value course group
0 2022-01-01 2022-01-01 10 first 1
1 2022-01-02 2022-01-02 20 first 1
2 2022-01-05 2022-01-05 30 second 3
3 2022-01-04 2022-01-04 40 second 3
4 2022-01-08 2022-01-08 21 first 2
5 2022-01-09 2022-01-09 92 first 2
6 2022-01-10 2022-01-10 55 first 2
Now you can do what you want with these groups.