Home > Software design >  Rolling difference of date column with groupby in pandas
Rolling difference of date column with groupby in pandas

Time:09-22

I have the following dataframe

import pandas as pd
from pandas import Timestamp
foo = pd.DataFrame.from_dict(data={'id': {0: '1',
  1: '1',
  2: '1',
  3: '2',
  4: '2'},
 'session': {0: 3, 1: 2, 2: 1, 3: 1, 4: 2},
 'start_time': {0: Timestamp('2021-09-02 19:49:19'),
  1: Timestamp('2021-09-16 10:54:21'),
  2: Timestamp('2021-07-12 17:11:54'),
  3: Timestamp('2021-03-02 01:53:22'),
  4: Timestamp('2021-01-09 11:38:35')}})

I would like to add a new column to foo, called diff_start_time, which would be the difference of the start_time column of the current session from the previous one, grouped by id. I would like the difference to be in hours.

How could I do that in python ?

CodePudding user response:

You can use .groupby() diff() dt.total_seconds() to get the total number of seconds in difference, then divide by 3600 to get the differences in hours.

df_out = foo.sort_values(['id', 'session'])
df_out['diff_start_time'] = df_out.groupby('id')['start_time'].diff().dt.total_seconds() / 3600

Result:

print(df_out)

  id  session          start_time  diff_start_time
2  1        1 2021-07-12 17:11:54              NaN
1  1        2 2021-09-16 10:54:21      1577.707500
0  1        3 2021-09-02 19:49:19      -327.083889
3  2        1 2021-03-02 01:53:22              NaN
4  2        2 2021-01-09 11:38:35     -1238.246389

CodePudding user response:

Use DataFrameGroupBy.diff with Series.dt.total_seconds:

foo['diff_start_time'] = foo.groupby('id')['start_time'].diff().dt.total_seconds().div(3600)
print (foo)

  id  session          start_time  diff_start_time
0  1        3 2021-09-02 19:49:19              NaN
1  1        2 2021-09-16 10:54:21       327.083889
2  1        1 2021-07-12 17:11:54     -1577.707500
3  2        1 2021-03-02 01:53:22              NaN
4  2        2 2021-01-09 11:38:35     -1238.246389

If necessary sorting first by id, session:

foo = foo.sort_values(['id','session'])
foo['diff_start_time'] = foo.groupby('id')['start_time'].diff().dt.total_seconds().div(3600)
print (foo)

  id  session          start_time  diff_start_time
2  1        1 2021-07-12 17:11:54              NaN
1  1        2 2021-09-16 10:54:21      1577.707500
0  1        3 2021-09-02 19:49:19      -327.083889
3  2        1 2021-03-02 01:53:22              NaN
4  2        2 2021-01-09 11:38:35     -1238.246389
  • Related