I have the following dataframe:
A B start_date end_date id
0 1 2 2022-01-01 2022-01-10 1
1 2 2 2022-02-02 2022-02-05 2
2 1 2 2022-01-11 2022-01-15 3
3 2 2 2022-02-06 2022-02-10 4
4 2 2 2022-02-11 2022-02-15 5
5 2 3 2022-01-14 2022-01-17 6
6 2 3 2022-01-19 2022-01-22 7
There are several records that follow one after the other. For example, rows 1 and 3. Row 3 has the same values A and B and starts the next day when row 1 ends. I want to compress this dataframe into the following form:
A B start_date end_date id
0 1 2 2022-01-01 2022-01-15 1
1 2 2 2022-02-02 2022-02-15 2
2 2 3 2022-01-14 2022-01-17 3
3 2 3 2022-01-19 2022-01-22 4
That is, I save one record where the difference between the start_date of the next record and the end_date of the previous one is 1 day. In this case, end_date is changed to end_date for the last record inside such a sequence.
CodePudding user response:
You can use a custom grouper to join the successive dates per group:
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(pd.to_datetime)
m = (df['start_date'].sub(df.groupby(['A', 'B'])
['end_date'].shift()
.add(pd.Timedelta('1d'))
).ne('0')
.groupby([df['A'], df['B']]).cumsum()
)
out = (df
.groupby(['A', 'B', m], as_index=False)
.agg({'start_date': 'first', 'end_date': 'last'})
.assign(id=lambda d: range(1, len(d) 1))
)
Output:
A B start_date end_date id
0 1 2 2022-01-01 2022-01-15 1
1 2 2 2022-02-02 2022-02-15 2
2 2 3 2022-01-14 2022-01-17 3
3 2 3 2022-01-19 2022-01-22 4