I'm having an issue with Pandas where the combination of groupby and shift seems to have data spilled between the groups.
Here's a reproducible example:
from pandas import Timestamp
sample = {'start': {0: Timestamp('2022-08-02 07:20:00'),
1: Timestamp('2022-08-02 07:25:00'),
2: Timestamp('2022-08-02 07:26:00'),
3: Timestamp('2022-08-02 07:35:00'),
4: Timestamp('2022-08-02 08:20:00'),
5: Timestamp('2022-08-02 08:25:00'),
6: Timestamp('2022-08-02 08:26:00'),
7: Timestamp('2022-08-02 08:35:00')},
'end': {0: Timestamp('2022-08-02 07:30:00'),
1: Timestamp('2022-08-02 07:35:00'),
2: Timestamp('2022-08-02 12:34:00'),
3: Timestamp('2022-08-02 07:40:00'),
4: Timestamp('2022-08-02 08:30:00'),
5: Timestamp('2022-08-02 08:55:00'),
6: Timestamp('2022-08-02 08:34:00'),
7: Timestamp('2022-08-02 08:40:00')},
'group': {0: 'G1',
1: 'G1',
2: 'G1',
3: 'G1',
4: 'G2',
5: 'G2',
6: 'G2',
7: 'G2'}}
df = pd.DataFrame(sample)
df = df.sort_values('start')
df['notworking'] = df.groupby('group')['end'].shift().cummax()
This gives the following output
start end group notworking
0 2022-08-02 07:20:00 2022-08-02 07:30:00 G1
1 2022-08-02 07:25:00 2022-08-02 07:35:00 G1 2022-08-02 07:30:00
2 2022-08-02 07:26:00 2022-08-02 12:34:00 G1 2022-08-02 07:35:00
3 2022-08-02 07:35:00 2022-08-02 07:40:00 G1 2022-08-02 12:34:00
4 2022-08-02 08:20:00 2022-08-02 08:30:00 G2
5 2022-08-02 08:25:00 2022-08-02 08:55:00 G2 2022-08-02 12:34:00
6 2022-08-02 08:26:00 2022-08-02 08:34:00 G2 2022-08-02 12:34:00
7 2022-08-02 08:35:00 2022-08-02 08:40:00 G2 2022-08-02 12:34:00
The 'end'
at index 2 is correctly assigned to 'notworking'
at index 3, but this value persists over in the next group.
My desired outcome is for cummax() to start fresh for each group, like this:
start end group notworking
0 2022-08-02 07:20:00 2022-08-02 07:30:00 G1
1 2022-08-02 07:25:00 2022-08-02 07:35:00 G1 2022-08-02 07:30:00
2 2022-08-02 07:26:00 2022-08-02 12:34:00 G1 2022-08-02 07:35:00
3 2022-08-02 07:35:00 2022-08-02 07:40:00 G1 2022-08-02 12:34:00
4 2022-08-02 08:20:00 2022-08-02 08:30:00 G2
5 2022-08-02 08:25:00 2022-08-02 08:55:00 G2 2022-08-02 08:30:00
6 2022-08-02 08:26:00 2022-08-02 08:34:00 G2 2022-08-02 08:55:00
7 2022-08-02 08:35:00 2022-08-02 08:40:00 G2 2022-08-02 08:55:00
I guess this is simple user error. Does anyone know a fix for this?
CodePudding user response:
groupby.shift
returns a Series so your cummax
is operated on the Series not your desired SeriesGroupBy. You can try groupby.transform
df['notworking'] = df.groupby('group')['end'].transform(lambda col: col.shift().cummax())
print(df)
start end group notworking
0 2022-08-02 07:20:00 2022-08-02 07:30:00 G1 NaT
1 2022-08-02 07:25:00 2022-08-02 07:35:00 G1 2022-08-02 07:30:00
2 2022-08-02 07:26:00 2022-08-02 12:34:00 G1 2022-08-02 07:35:00
3 2022-08-02 07:35:00 2022-08-02 07:40:00 G1 2022-08-02 12:34:00
4 2022-08-02 08:20:00 2022-08-02 08:30:00 G2 NaT
5 2022-08-02 08:25:00 2022-08-02 08:55:00 G2 2022-08-02 08:30:00
6 2022-08-02 08:26:00 2022-08-02 08:34:00 G2 2022-08-02 08:55:00
7 2022-08-02 08:35:00 2022-08-02 08:40:00 G2 2022-08-02 08:55:00