Python - Calculate Time Between Visits (Using Start and Stop)-CodePudding

This may have been answered previously, but I'm not sure of the correct terms to search for, so apologies in advance if so.

I am connecting to a MSSQL database using pyodbc and pulling a list of visits. These visits have a start and end time:

select EMP_ID, VISIT_DATE, VISITSTART, VISITSTOP 
from VISITS 
where 
EMP_ID in (<List of Emps>) and SORTDATE between '2023-01-16' and '2023-01-29'
order by EMP_ID, VISIT_DATE

Returning:

EMP_ID      VISIT_DATE              VISITSTART              VISITSTOP
S0000000163 2023-01-16 00:00:00.000 2023-01-16 08:59:00.000 2023-01-16 10:32:00.000
S0000000163 2023-01-16 00:00:00.000 2023-01-16 10:58:00.000 2023-01-16 12:35:00.000
S0000000163 2023-01-16 00:00:00.000 2023-01-16 13:06:00.000 2023-01-16 14:06:00.000
S0000000166 2023-01-16 00:00:00.000 2023-01-16 07:21:00.000 2023-01-16 08:15:00.000
S0000000166 2023-01-16 00:00:00.000 2023-01-16 08:45:00.000 2023-01-16 08:52:00.000
S0000000166 2023-01-16 00:00:00.000 2023-01-16 09:56:00.000 2023-01-16 10:46:00.000

What I am trying to calculate is the difference, in minutes, from the end of one visit to the start of the next. This then needs to be grouped by EMP_ID, per day.

So in the example above:

The first visit ends at 1032, the next starts at 1058 - this is a gap of 26 minutes
The second visit ends at 1235, the next starts at 1306 - this is a gap of 31 minutes

And so on, moving on to the next day for the same employee, or starting again with the next employee. For employee 163 on the 16th Jan there is a total of 57 minutes between visits.

I know there I could do this using loops but I know there is a "cleaner" way to do this, using some kind of rolling window. I just lack the knowledge to search for the correct functions and packages.

The posts and blogs I have come across calculate the difference in times in a single column, but I have two.

Ultimately, what I would like is an output along the lines of:

EMP_ID      VISITDATE  IDLE_TIME
S0000000163 2023-01-16 56
S0000000163 2023-01-17 48
S0000000166 2023-01-16 92
S0000000166 2023-01-17 54

This will process approximately 150,000 individual visits per processing run.

Any assistance greatly appreciated!

Thank You

CodePudding user response：

Pandas library provides a rich functionality for such cases.
Assuming that you loaded the selected data from your MSSQL connection into a dataframe with:

import pandas as pd

...
df = pd.read_sql(your_query, your_pyodbc_connection)

Here is grouping by 'EMP_ID','VISIT_DATE' and calculating differences between adjacent visits within a group:

visit_gaps = df.groupby(['EMP_ID','VISIT_DATE']).apply(lambda x: x['VISITSTART'].shift(-1).sub(x['VISITSTOP'])
                                                       .astype('timedelta64[m]').sum()).to_frame('IDLE_TIME').reset_index()

        EMP_ID VISIT_DATE  IDLE_TIME
0  S0000000163 2023-01-16       57.0
1  S0000000166 2023-01-16       94.0