how to get employee count by Hour and Date using pySpark / python?-CodePudding

I have employee id, their clock in, and clock out timings by day. I want to calculate number of employee present in office by hour by Date.

Example Data

import pandas as pd
data1 = {'emp_id': ['Employee 1', 'Employee 2', 'Employee 3', 'Employee 4', 'Employee 5'],
    'Clockin': ['12/5/2021 0:08','8/7/2021 0:04','3/30/2021 1:24','12/23/2021 22:45', '12/23/2021 23:29'],
    'Clockout': ['12/5/2021 3:28','8/7/2021 0:34','3/30/2021 4:37','12/24/2021 0:42', '12/24/2021 1:42']}
df1 = pd.DataFrame(data1)

Example of output

import pandas as pd
data2 = {'Date': ['12/5/2021', '8/7/2021', '3/30/2021','3/30/2021','3/30/2021','3/30/2021', '12/23/2021','12/23/2021','12/24/2021','12/24/2021'],
    'Hour': ['01:00','01:00','02:00','03:00','04:00','05:00', '22:00','23:00', '01:00','02:00'],
    'emp_count': [1,1,1,1,1,1,1,2, 2,1]}
df2 = pd.DataFrame(data2)

CodePudding user response：

Try this:

# Round clock in DOWN to the nearest PRECEDING hour
clock_in = pd.to_datetime(df1["Clockin"]).dt.floor("H")
# Round clock out UP to the nearest SUCCEEDING hour
clock_out = pd.to_datetime(df1["Clockout"]).dt.ceil("H")

# Generate time series at hourly frequency between adjusted clock in and clock
# out time
hours = pd.Series(
    [
        pd.date_range(in_, out_, freq="H", inclusive="right")
        for in_, out_ in zip(clock_in, clock_out)
    ]
).explode()

# Final result
hours.groupby(hours).count()

Result:

2021-03-30 02:00:00    1
2021-03-30 03:00:00    1
2021-03-30 04:00:00    1
2021-03-30 05:00:00    1
2021-08-07 01:00:00    1
2021-12-05 01:00:00    1
2021-12-05 02:00:00    1
2021-12-05 03:00:00    1
2021-12-05 04:00:00    1
2021-12-23 23:00:00    1
2021-12-24 00:00:00    2
2021-12-24 01:00:00    2
2021-12-24 02:00:00    1
dtype: int64

It's slightly different from your expected output but consistent with your business rules.