I have the following dataframe
import pandas as pd
from pandas import Timestamp
foo = pd.DataFrame.from_dict(data={'id': {0: '1',
1: '1',
2: '1',
3: '2',
4: '2'},
'session': {0: 3, 1: 2, 2: 1, 3: 1, 4: 2},
'start_time': {0: Timestamp('2021-09-02 19:49:19'),
1: Timestamp('2021-09-16 10:54:21'),
2: Timestamp('2021-07-12 17:11:54'),
3: Timestamp('2021-03-02 01:53:22'),
4: Timestamp('2021-01-09 11:38:35')}})
I would like to add a new column to foo
, called diff_start_time
, which would be the difference of the start_time
column of the current session
from the previous one, grouped by id
. I would like the difference to be in hours
.
How could I do that in python ?
CodePudding user response:
You can use .groupby()
diff()
dt.total_seconds()
to get the total number of seconds in difference, then divide by 3600 to get the differences in hours.
df_out = foo.sort_values(['id', 'session'])
df_out['diff_start_time'] = df_out.groupby('id')['start_time'].diff().dt.total_seconds() / 3600
Result:
print(df_out)
id session start_time diff_start_time
2 1 1 2021-07-12 17:11:54 NaN
1 1 2 2021-09-16 10:54:21 1577.707500
0 1 3 2021-09-02 19:49:19 -327.083889
3 2 1 2021-03-02 01:53:22 NaN
4 2 2 2021-01-09 11:38:35 -1238.246389
CodePudding user response:
Use DataFrameGroupBy.diff
with Series.dt.total_seconds
:
foo['diff_start_time'] = foo.groupby('id')['start_time'].diff().dt.total_seconds().div(3600)
print (foo)
id session start_time diff_start_time
0 1 3 2021-09-02 19:49:19 NaN
1 1 2 2021-09-16 10:54:21 327.083889
2 1 1 2021-07-12 17:11:54 -1577.707500
3 2 1 2021-03-02 01:53:22 NaN
4 2 2 2021-01-09 11:38:35 -1238.246389
If necessary sorting first by id, session
:
foo = foo.sort_values(['id','session'])
foo['diff_start_time'] = foo.groupby('id')['start_time'].diff().dt.total_seconds().div(3600)
print (foo)
id session start_time diff_start_time
2 1 1 2021-07-12 17:11:54 NaN
1 1 2 2021-09-16 10:54:21 1577.707500
0 1 3 2021-09-02 19:49:19 -327.083889
3 2 1 2021-03-02 01:53:22 NaN
4 2 2 2021-01-09 11:38:35 -1238.246389