Home > Mobile >  Is there a simple way to calculate the sum of the duration for each hour in the dataframe?
Is there a simple way to calculate the sum of the duration for each hour in the dataframe?

Time:06-07

I have a dataframe simmilar to this one:

user start_date end_date activity
1 2022-05-21 07:23:58 2022-05-21 14:23:48 Sleep
2 2022-05-21 12:59:16 2022-05-21 14:59:16 Eat
1 2022-05-21 18:59:16 2022-05-21 21:20:16 Work
3 2022-05-21 18:50:00 2022-05-21 21:20:16 Work

I want to have a dataframe in which the columns will be certain activities and the rows would contain the sum of the duration of each of these activities for all users in that hour. Sorry, I find it hard to even put my thoughts into words. The expected result should be simillar to this one:

hour sleep[s] eat[s] work[s]
00 0 0 0
01 0 0 0
... ... ... ...
07 2162 0 0
08 3600 0 0
... ... ... ...
18 0 0 644
... ... ... ...

The dataframe has more than 10 milion rows so I'm looking for something fast. I was trying to do something with crosstab to get expected columns and resample to get the rows, but I'm not even close to find a soluction. I would be really grateful for any ideas of how to do that. This is my first question, so I apologize in advance for all mistakes :)

CodePudding user response:

Here's what I did with the 4 rows of data you posted.

Here's what I tried to do and my assumptions:

Goal = for each row, 
- calculate total elapsed time, 
- determine start hour sh, end hour eh, 
- work out how many seconds pertain to sh & eh

then:
-  create a 24-row df with results added up for all users (assuming the data starts at 00:00 and ends at 23:59 on the same day, there isn't any other day, and you're interested in the aggregate result for all users)

I'm using a For loop to build the final dataframe. It's probably not efficient enough if you have 10m rows in your original dataset, but it gets the job done on a smaller subset. Hopefully you can improve upon it!

import pandas as pd
import numpy as np


# work out total elapsed time, in seconds
df['total'] = df['end_date'] - df['start_date']
df['total'] = df['total'].dt.total_seconds()

# determine start/end hours (just the value: 07:00 AM => 7)
df['sh'] = df['start_date'].dt.hour
df['eh'] = df['end_date'].dt.hour

#determine start/end hours again
df['next_hour'] = df['start_date'].dt.ceil('H') # 07:23:58 => 08:00:00
df['previous_hour'] = df['end_date'].dt.floor('H') # 14:23:48 => 14:00:00
df

# determine how many seconds pertain to start/end hours
df['sh_s'] = df['next_hour'] - df['start_date']
df['sh_s'] = df['sh_s'].dt.total_seconds()

df['eh_s'] = df['end_date'] - df['previous_hour']
df['eh_s'] = df['eh_s'].dt.total_seconds()

df['remaining_seconds'] = df['total'] - df['sh_s'] - df['eh_s']

df

# add all column hours to df
hour_columns = np.arange(24)
df[hour_columns] = ''
df

# Sorry for this horrible loop below... Hopefully you can improve upon it or get rid of it completely!
for i in range(len(df)): # for each row:
    
    sh = df.iloc[i, df.columns.get_loc('sh')] # get start hour value
    sh_s = df.iloc[i, df.columns.get_loc('sh_s')] #get seconds pertaining to start hour
    
    eh = df.iloc[i, df.columns.get_loc('eh')] # get end hour value
    eh_s = df.iloc[i, df.columns.get_loc('eh_s')] #get seconds pertaining to end hour
    
    # fill in hour columns
    df.iloc[i, df.columns.get_loc(sh)] = sh_s
    df.iloc[i, df.columns.get_loc(eh)] = eh_s
    
    # for each col between sh & eh, input 3600
    for j in range(sh   1, eh):
        df.iloc[i, df.columns.get_loc(j)] = 3600

df.groupby('activity', as_index=False)[hour_columns].sum().transpose()

Result:

enter image description here

CodePudding user response:

I am assuming you the activities start time can span more than one day and you actually want the summary for each day. If not, the answer could be adapted by using datetime.hour instead of binning the start time.

The answer:

  1. Calculate the duration of each activity in seconds
  2. Bin/cut the start time by hour
  3. pivot the dataframe by activity

Code:

import pandas as pd
import numpy as np
import math
from datetime import datetime


data={'users':[1,2,1,3],
      'start':['2022-05-21 07:23:58', '2022-05-21 12:59:16', '2022-05-21 18:59:16', '2022-05-21 18:50:00'],
     'end':[ '2022-05-21 14:23:48', '2022-05-21 14:59:16', '2022-05-21 21:20:16', '2022-05-21 21:20:16'],
'activity':[ 'Sleep', 'Eat', 'Work', 'Work']
}

df=pd.DataFrame(data)

#concert to datetime
df.start=pd.to_datetime(df.start)
df.end=pd.to_datetime(df.end)

# *** The answer starts here ***

#find the duration of each activity in seconds
df['duration']=(df.end -df.start)/pd.Timedelta(seconds=1)

# bin the start time by hour

# find bin range to generate the bin datetime range
min=df.start.min() 
binstart =datetime(min.year,min.month,min.day,min.hour)
perriod = math.ceil( (df.start.max()-min)/pd.Timedelta(hours=1)) 1
bin= pd.date_range(binstart, periods=perriod, freq="H")

# bin start time
df['start']=pd.cut(df.start,bins=bin)

# pivot the table
pd.pivot_table(df,index=['start'],columns=['activity'], values=['duration'],aggfunc='sum')

Result:

                                                     duration                  
activity                                        Eat    Sleep     Work
start                                                                
(2022-05-21 07:00:00, 2022-05-21 08:00:00]      0.0  25190.0      0.0
(2022-05-21 08:00:00, 2022-05-21 09:00:00]      0.0      0.0      0.0
(2022-05-21 09:00:00, 2022-05-21 10:00:00]      0.0      0.0      0.0
(2022-05-21 10:00:00, 2022-05-21 11:00:00]      0.0      0.0      0.0
(2022-05-21 11:00:00, 2022-05-21 12:00:00]      0.0      0.0      0.0
(2022-05-21 12:00:00, 2022-05-21 13:00:00]   7200.0      0.0      0.0
(2022-05-21 13:00:00, 2022-05-21 14:00:00]      0.0      0.0      0.0
(2022-05-21 14:00:00, 2022-05-21 15:00:00]      0.0      0.0      0.0
(2022-05-21 15:00:00, 2022-05-21 16:00:00]      0.0      0.0      0.0
(2022-05-21 16:00:00, 2022-05-21 17:00:00]      0.0      0.0      0.0
(2022-05-21 17:00:00, 2022-05-21 18:00:00]      0.0      0.0      0.0
(2022-05-21 18:00:00, 2022-05-21 19:00:00]      0.0      0.0  17476.0
  • Related