Home > Back-end >  Pandas groupby and shift spilling between groups
Pandas groupby and shift spilling between groups

Time:08-03

I'm having an issue with Pandas where the combination of groupby and shift seems to have data spilled between the groups.

Here's a reproducible example:

from pandas import Timestamp

sample = {'start': {0: Timestamp('2022-08-02 07:20:00'),
      1: Timestamp('2022-08-02 07:25:00'),
      2: Timestamp('2022-08-02 07:26:00'),
      3: Timestamp('2022-08-02 07:35:00'),
      4: Timestamp('2022-08-02 08:20:00'),
      5: Timestamp('2022-08-02 08:25:00'),
      6: Timestamp('2022-08-02 08:26:00'),
      7: Timestamp('2022-08-02 08:35:00')},
     'end': {0: Timestamp('2022-08-02 07:30:00'),
      1: Timestamp('2022-08-02 07:35:00'),
      2: Timestamp('2022-08-02 12:34:00'),
      3: Timestamp('2022-08-02 07:40:00'),
      4: Timestamp('2022-08-02 08:30:00'),
      5: Timestamp('2022-08-02 08:55:00'),
      6: Timestamp('2022-08-02 08:34:00'),
      7: Timestamp('2022-08-02 08:40:00')},
     'group': {0: 'G1',
      1: 'G1',
      2: 'G1',
      3: 'G1',
      4: 'G2',
      5: 'G2',
      6: 'G2',
      7: 'G2'}}

df = pd.DataFrame(sample)
df = df.sort_values('start')

df['notworking'] = df.groupby('group')['end'].shift().cummax()

This gives the following output

    start               end                 group   notworking
0   2022-08-02 07:20:00 2022-08-02 07:30:00 G1  
1   2022-08-02 07:25:00 2022-08-02 07:35:00 G1      2022-08-02 07:30:00
2   2022-08-02 07:26:00 2022-08-02 12:34:00 G1      2022-08-02 07:35:00
3   2022-08-02 07:35:00 2022-08-02 07:40:00 G1      2022-08-02 12:34:00
4   2022-08-02 08:20:00 2022-08-02 08:30:00 G2  
5   2022-08-02 08:25:00 2022-08-02 08:55:00 G2      2022-08-02 12:34:00
6   2022-08-02 08:26:00 2022-08-02 08:34:00 G2      2022-08-02 12:34:00
7   2022-08-02 08:35:00 2022-08-02 08:40:00 G2      2022-08-02 12:34:00

The 'end' at index 2 is correctly assigned to 'notworking' at index 3, but this value persists over in the next group.

My desired outcome is for cummax() to start fresh for each group, like this:

    start               end                 group   notworking
0   2022-08-02 07:20:00 2022-08-02 07:30:00 G1  
1   2022-08-02 07:25:00 2022-08-02 07:35:00 G1      2022-08-02 07:30:00
2   2022-08-02 07:26:00 2022-08-02 12:34:00 G1      2022-08-02 07:35:00
3   2022-08-02 07:35:00 2022-08-02 07:40:00 G1      2022-08-02 12:34:00
4   2022-08-02 08:20:00 2022-08-02 08:30:00 G2  
5   2022-08-02 08:25:00 2022-08-02 08:55:00 G2      2022-08-02 08:30:00
6   2022-08-02 08:26:00 2022-08-02 08:34:00 G2      2022-08-02 08:55:00
7   2022-08-02 08:35:00 2022-08-02 08:40:00 G2      2022-08-02 08:55:00

I guess this is simple user error. Does anyone know a fix for this?

CodePudding user response:

groupby.shift returns a Series so your cummax is operated on the Series not your desired SeriesGroupBy. You can try groupby.transform

df['notworking'] = df.groupby('group')['end'].transform(lambda col: col.shift().cummax())
print(df)

                start                 end group          notworking
0 2022-08-02 07:20:00 2022-08-02 07:30:00    G1                 NaT
1 2022-08-02 07:25:00 2022-08-02 07:35:00    G1 2022-08-02 07:30:00
2 2022-08-02 07:26:00 2022-08-02 12:34:00    G1 2022-08-02 07:35:00
3 2022-08-02 07:35:00 2022-08-02 07:40:00    G1 2022-08-02 12:34:00
4 2022-08-02 08:20:00 2022-08-02 08:30:00    G2                 NaT
5 2022-08-02 08:25:00 2022-08-02 08:55:00    G2 2022-08-02 08:30:00
6 2022-08-02 08:26:00 2022-08-02 08:34:00    G2 2022-08-02 08:55:00
7 2022-08-02 08:35:00 2022-08-02 08:40:00    G2 2022-08-02 08:55:00
  • Related