I have a dataset which records 'time-from' and 'time-to' for each user as time of day in a 24 hour format. Example:
User time_from time_to Activity
a 6:00 8:00 2
a 8:00 8:30 1
a 8:30 10:00 1
a 10:00 13:00 0
b 6:00 8:00 2
b 8:00 8:30 0
b 8:30 10:00 2
b 10:00 13:00 1
.
.
.
I want to get the number of hours spent on each activity for these users on an activity. I figured my intermediate step is to get:
User time_from time_to Activity time_spent
a 6:00 8:00 2 2:00
a 8:00 8:30 1 0:30
a 8:30 10:00 1 1:30
a 10:00 13:00 0 3:00
b 6:00 8:00 2 2:00
b 8:00 8:30 0 0:30
b 8:30 10:00 2 1:30
b 10:00 13:00 1 3:00
.
.
.
Which I try to get using:
df['time_from'] = pd.to_datetime(df['time_from'], format='%H:%M')
df['time_to'] = pd.to_datetime(df['time_to'], format='%H:%M')
df['time_spent'] = df['time_to'] - df['time_from']
However it gives me an output like:
User time_from time_to Activity time_spent
a 1900-01-01 6:00 1900-01-01 8:00 2 0 days 2:00
a 1900-01-01 8:00 1900-01-01 8:30 1 0 days 0:30
a 1900-01-01 8:30 1900-01-01 10:00 1 0 days 1:30
a 1900-01-01 10:00 1900-01-01 13:00 0 0 days 3:00
b 1900-01-01 6:00 1900-01-01 8:00 2 0 days 2:00
b 1900-01-01 8:00 1900-01-01 8:30 0 0 days 0:30
b 1900-01-01 8:30 1900-01-01 10:00 2 0 days 1:30
b 1900-01-01 10:00 1900-01-01 13:00 1 0 days 3:00
.
.
.
Can somebody help me get the desired result? I might be able to group_by on activities and them sum time_spent to get my ultimate result, but the output I am getting right now probably won't work.
CodePudding user response:
Because you convert the time to datetime which gives you a Timestamp with structure ('1900-01-01 06:00:00'). I suggest to substract it directly.
CodePudding user response:
You can use the groupby
on Users
and Activity
and immediately follow by agg
(sum) to total the time_spent
of a user on a given activity.
df['time_from'] = pd.to_datetime(df['time_from'], format='%H:%M')
df['time_to'] = pd.to_datetime(df['time_to'], format='%H:%M')
df['time_spent'] = df['time_to'] - df['time_from']
new_df = df.groupby(['User','Activity']).agg(sum)
Output from new_df
User Activity
a 0 0 days 03:00:00
1 0 days 02:00:00
2 0 days 02:00:00
b 0 0 days 00:30:00
1 0 days 03:00:00
2 0 days 03:30:00