Merge records that follow one another within group-CodePudding

I have the following dataframe:

   A  B start_date   end_date  id
0  1  2 2022-01-01 2022-01-10   1
1  2  2 2022-02-02 2022-02-05   2
2  1  2 2022-01-11 2022-01-15   3
3  2  2 2022-02-06 2022-02-10   4
4  2  2 2022-02-11 2022-02-15   5
5  2  3 2022-01-14 2022-01-17   6
6  2  3 2022-01-19 2022-01-22   7

There are several records that follow one after the other. For example, rows 1 and 3. Row 3 has the same values A and B and starts the next day when row 1 ends. I want to compress this dataframe into the following form:

   A  B start_date   end_date  id
0  1  2 2022-01-01 2022-01-15   1
1  2  2 2022-02-02 2022-02-15   2
2  2  3 2022-01-14 2022-01-17   3
3  2  3 2022-01-19 2022-01-22   4

That is, I save one record where the difference between the start_date of the next record and the end_date of the previous one is 1 day. In this case, end_date is changed to end_date for the last record inside such a sequence.

CodePudding user response：

You can use a custom grouper to join the successive dates per group:

df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(pd.to_datetime)

m = (df['start_date'].sub(df.groupby(['A', 'B'])
                        ['end_date'].shift()
                          .add(pd.Timedelta('1d'))
                        ).ne('0')
      .groupby([df['A'], df['B']]).cumsum()
    )

out = (df
 .groupby(['A', 'B', m], as_index=False)
 .agg({'start_date': 'first', 'end_date': 'last'})
 .assign(id=lambda d: range(1, len(d) 1))
)

Output:

   A  B start_date   end_date  id
0  1  2 2022-01-01 2022-01-15   1
1  2  2 2022-02-02 2022-02-15   2
2  2  3 2022-01-14 2022-01-17   3
3  2  3 2022-01-19 2022-01-22   4