Count employees that were clocked in for each hour of the day-CodePudding

I have a file that contains all of the time punch shifts for each employee like so:

employee	start_time	end_time
123	2022-10-23 10:40:00.000	2022-10-23 14:00:00.000
123	2022-10-23 14:00:00.000	2022-10-23 14:30:00.000
123	2022-10-23 14:35:00.000	2022-10-23 17:07:00.000
541	2022-10-23 06:50:00.000	2022-10-23 12:00:00.000
541	2022-10-23 13:00:00.000	2022-10-23 15:30:00.000
799	2022-10-23 18:00:00.000	2022-10-23 18:30:00.000
799	2022-10-23 18:35:00.000	2022-10-24 10:30:00.000

My ultimate ask is to be able to know how many employees were clocked in during each hour of the day like so:

Date	Hour	Count
2022-10-23	6	1
2022-10-23	7	1
2022-10-23	8	1
2022-10-23	9	1
2022-10-23	10	2
2022-10-23	11	2
2022-10-23	12	2
2022-10-23	13	2
2022-10-23	14	2
2022-10-23	15	2
2022-10-23	16	1
2022-10-23	17	1
2022-10-24	1	1
2022-10-24	2	1
2022-10-24	3	1
2022-10-24	4	1
2022-10-24	5	1
2022-10-24	6	1
2022-10-24	7	1
2022-10-24	8	1
2022-10-24	9	1
2022-10-24	10	1

The way I had thought about tackling this problem was by generating a series of 24 hours and simply counting based on clock ins and outs (demo available at the end of post). An issue I initially ran into was that my query was counting duplicates for employees that would clock in and out within the same hour. This issue was solved by another question I posted here

This part of the problem was solved by returning the following before baking it into my query like so:

employee	start_time	end_time
123	2022-10-23 10:40:00.000	2022-10-23 17:07:00.000
541	2022-10-23 06:50:00.000	2022-10-23 12:00:00.000
541	2022-10-23 13:00:00.000	2022-10-23 15:30:00.000
799	2022-10-23 18:00:00.000	2022-10-24 10:30:00.000

However, after baking this solution into my existing query, I am getting counts for 10/23/2022 but nothing for 10/24/2022. That is, my query is not accounting for when an employee clocks in on one day and clocks out on the next.

Here's my attempted query:

CREATE TABLE tab(
    employee        INT,
    start_time      DATETIME,    
    end_time        DATETIME
);
INSERT INTO tab VALUES
(123,       '2022-10-23 10:40:00.000', '2022-10-23 14:00:00.000'),
(123,       '2022-10-23 14:00:00.000', '2022-10-23 14:30:00.000'),
(123,       '2022-10-23 14:35:00.000', '2022-10-23 17:07:00.000'),
(541,       '2022-10-23 06:50:00.000', '2022-10-23 12:00:00.000'),
(541,       '2022-10-23 13:00:00.000', '2022-10-23 15:30:00.000'),
(799,       '2022-10-23 18:00:00.000', '2022-10-23 22:30:00.000'),
(799,       '2022-10-23 22:35:00.000', '2022-10-24 10:30:00.000');

SELECT * FROM tab

WITH cte AS (
    SELECT *,
           CASE WHEN DATEDIFF(mi, LAG(end_time) OVER(PARTITION BY employee ORDER BY start_time), start_time) >= 60 
                THEN 1 ELSE 0 
           END AS change_partition
    FROM tab
), cte2 AS (
    SELECT *, SUM(change_partition) OVER(PARTITION BY employee ORDER BY start_time) AS partitions
    FROM cte
),

  Hours AS(
    
    SELECT 1 AS HOUR
    UNION ALL
    SELECT HOUR   1 FROM Hours WHERE HOUR < 24

)
SELECT 
    CAST(Start_Time AS DATE) [DATE], 
    HOUR, 
    COUNT(Employee) AS [Head Count]
FROM 
    Hours

LEFT JOIN(

    SELECT 
        Employee, MIN(start_time) AS start_time, MAX(end_time) AS end_time 
    FROM 
        cte2
    GROUP BY 
        Employee, partitions

) T on HOUR BETWEEN DATEPART(HOUR, START_TIME) AND DATEPART(HOUR, END_TIME)

GROUP BY 
    CAST(Start_Time AS DATE), HOUR
ORDER BY 
    CAST(Start_Time AS DATE), HOUR asc

My current query returns only this:

Date	Hour	Count
2022-10-23	6	1
2022-10-23	7	1
2022-10-23	8	1
2022-10-23	9	1
2022-10-23	10	2
2022-10-23	11	2
2022-10-23	12	2
2022-10-23	13	2
2022-10-23	14	2
2022-10-23	15	2
2022-10-23	16	1
2022-10-23	17	1

It should also return this:

Date	Hour	Count
2022-10-24	1	1
2022-10-24	2	1
2022-10-24	3	1
2022-10-24	4	1
2022-10-24	5	1
2022-10-24	6	1
2022-10-24	7	1
2022-10-24	8	1
2022-10-24	9	1
2022-10-24	10	1

Perhaps there is a completely different approach that would be best but I've included a demo here

CodePudding user response：

One straightforward way of doing this is:

generate your calendar first, using the GENERATE_SERIES function, beginning with your first date (of your choice, according to the data you have) and running on for 48 times at timestep 1 hour.
left join calendar with your table and check when the calendar datetime is found between your employee's start and end datetime
aggregate with COUNT your employees

WITH calendar AS (
    SELECT DATEADD(HOUR, value, '2022-10-23 00:00:00') AS cal_date
    FROM GENERATE_SERIES(0, 48, 1) 
)
SELECT cal_date, 
       COUNT(DISTINCT employee) AS headcount
FROM      calendar
LEFT JOIN tab
       ON calendar.cal_date BETWEEN tab.start_time AND tab.end_time
GROUP BY cal_date

Check the demo here.

Side Note: It's never recommended to split your hours from the date. Prefer having a datetime field, instead of a date integer for hours field. Reason is because SQL Server is able to handle a single datetime field in a better way, using the tools that it makes available for us to use.

If your SQL Server version doesn't have GENERATE_SERIES at your disposal, you can use a recursive query to generate your calendar:

WITH calendar AS (
    SELECT CAST('2022-10-23 00:00:00' AS DATETIME) AS cal_date
    UNION ALL
    SELECT DATEADD(hh, 1, cal_date) AS cal_date
    FROM calendar
    WHERE cal_date < '2022-10-25 00:00:00'
)
...

Recursion will begin from the value you set in the first part of the query (namely CAST('2022-10-23 00:00:00' AS DATETIME)) and will add up 1 hour for each call, till recursion stops at the filtering condition (WHERE cal_date < '2022-10-25 00:00:00').

Check the demo with recursion here.