Python: calculation using data of dataframe efficiently-CodePudding

I have a dataframe with data of customers entering and leaving a building, see example here:

import pandas as pd
from datetime import datetime, timedelta

data = {'customer':  ['nameA', 'nameA', 'nameB', 'nameC', 'nameB', 'nameB', 'nameD', 'nameC', 'nameD', 'nameB'],
        'event': ['in', 'out', 'in', 'in', 'out', 'in', 'in', 'out', 'out', 'out',],
        'time_stamp': ['2020-04-20 09:58:47', '2020-04-20 19:58:52', '2020-04-20 09:28:52',
                       '2020-04-20 10:18:52', '2020-04-20 09:58:52', '2020-04-20 11:08:52',
                       '2020-04-20 13:58:52', '2020-04-20 14:58:52', '2020-04-20 15:58:52',
                       '2020-04-20 19:58:52']}

Is there a good way to calculate the time spent in the building for each person?

The problems I have is that some people enter the buildings more often, which makes it more difficult to handle. So far I split the dataframe into two separate ones 'in' and 'out' based on the corresponding event, delete all duplicates (so people who enter the building multiple times are ignored completely), sort both dataframes by customer and calculate the time difference between the dataframes 'in' and 'out'.

Is there a better way to do it? I was thinking iterating through the dataframes, but I think you should avoid it for time reasons if you have large data sets...

Thanks for your ideas!

CodePudding user response：

You could reshape, calculate timedeltas and sum them.

NB. this approach relies on the fact that a customer has both in and out values for each occurrence of a visit and that the dates are sorted (if not add a sorting step)

df = pd.DataFrame(data)

df['time_stamp'] = pd.to_datetime(df['time_stamp'])

(df
 # identify the occurrence of in/out
 .assign(group=df.groupby(['customer', 'event']).cumcount())
 # reshape to have in/out columns
 .pivot(index=['customer', 'group'], columns='event', values='time_stamp')
 # compute the duration
 .assign(duration=lambda d: d['out']-d['in'])
 # sum the durations per customer
 .groupby('customer')['duration'].sum()
)

output:

customer
nameA   0 days 10:00:05
nameB   0 days 09:20:00
nameC   0 days 04:40:00
nameD   0 days 02:00:00
Name: duration, dtype: timedelta64[ns]