I have a file that contains all of the time punch shifts for each employee like so:
employee | start_time | end_time |
---|---|---|
123 | 2022-10-23 10:40:00.000 | 2022-10-23 14:00:00.000 |
123 | 2022-10-23 14:00:00.000 | 2022-10-23 14:30:00.000 |
123 | 2022-10-23 14:35:00.000 | 2022-10-23 17:07:00.000 |
541 | 2022-10-23 06:50:00.000 | 2022-10-23 12:00:00.000 |
541 | 2022-10-23 13:00:00.000 | 2022-10-23 15:30:00.000 |
799 | 2022-10-23 18:00:00.000 | 2022-10-23 18:30:00.000 |
799 | 2022-10-23 18:35:00.000 | 2022-10-24 10:30:00.000 |
My ultimate ask is to be able to know how many employees were clocked in during each hour of the day like so:
Date | Hour | Count |
---|---|---|
2022-10-23 | 6 | 1 |
2022-10-23 | 7 | 1 |
2022-10-23 | 8 | 1 |
2022-10-23 | 9 | 1 |
2022-10-23 | 10 | 2 |
2022-10-23 | 11 | 2 |
2022-10-23 | 12 | 2 |
2022-10-23 | 13 | 2 |
2022-10-23 | 14 | 2 |
2022-10-23 | 15 | 2 |
2022-10-23 | 16 | 1 |
2022-10-23 | 17 | 1 |
2022-10-24 | 1 | 1 |
2022-10-24 | 2 | 1 |
2022-10-24 | 3 | 1 |
2022-10-24 | 4 | 1 |
2022-10-24 | 5 | 1 |
2022-10-24 | 6 | 1 |
2022-10-24 | 7 | 1 |
2022-10-24 | 8 | 1 |
2022-10-24 | 9 | 1 |
2022-10-24 | 10 | 1 |
The way I had thought about tackling this problem was by generating a series of 24 hours and simply counting based on clock ins and outs (demo available at the end of post). An issue I initially ran into was that my query was counting duplicates for employees that would clock in and out within the same hour. This issue was solved by another question I posted here
This part of the problem was solved by returning the following before baking it into my query like so:
employee | start_time | end_time |
---|---|---|
123 | 2022-10-23 10:40:00.000 | 2022-10-23 17:07:00.000 |
541 | 2022-10-23 06:50:00.000 | 2022-10-23 12:00:00.000 |
541 | 2022-10-23 13:00:00.000 | 2022-10-23 15:30:00.000 |
799 | 2022-10-23 18:00:00.000 | 2022-10-24 10:30:00.000 |
However, after baking this solution into my existing query, I am getting counts for 10/23/2022 but nothing for 10/24/2022. That is, my query is not accounting for when an employee clocks in on one day and clocks out on the next.
Here's my attempted query:
CREATE TABLE tab(
employee INT,
start_time DATETIME,
end_time DATETIME
);
INSERT INTO tab VALUES
(123, '2022-10-23 10:40:00.000', '2022-10-23 14:00:00.000'),
(123, '2022-10-23 14:00:00.000', '2022-10-23 14:30:00.000'),
(123, '2022-10-23 14:35:00.000', '2022-10-23 17:07:00.000'),
(541, '2022-10-23 06:50:00.000', '2022-10-23 12:00:00.000'),
(541, '2022-10-23 13:00:00.000', '2022-10-23 15:30:00.000'),
(799, '2022-10-23 18:00:00.000', '2022-10-23 22:30:00.000'),
(799, '2022-10-23 22:35:00.000', '2022-10-24 10:30:00.000');
SELECT * FROM tab
WITH cte AS (
SELECT *,
CASE WHEN DATEDIFF(mi, LAG(end_time) OVER(PARTITION BY employee ORDER BY start_time), start_time) >= 60
THEN 1 ELSE 0
END AS change_partition
FROM tab
), cte2 AS (
SELECT *, SUM(change_partition) OVER(PARTITION BY employee ORDER BY start_time) AS partitions
FROM cte
),
Hours AS(
SELECT 1 AS HOUR
UNION ALL
SELECT HOUR 1 FROM Hours WHERE HOUR < 24
)
SELECT
CAST(Start_Time AS DATE) [DATE],
HOUR,
COUNT(Employee) AS [Head Count]
FROM
Hours
LEFT JOIN(
SELECT
Employee, MIN(start_time) AS start_time, MAX(end_time) AS end_time
FROM
cte2
GROUP BY
Employee, partitions
) T on HOUR BETWEEN DATEPART(HOUR, START_TIME) AND DATEPART(HOUR, END_TIME)
GROUP BY
CAST(Start_Time AS DATE), HOUR
ORDER BY
CAST(Start_Time AS DATE), HOUR asc
My current query returns only this:
Date | Hour | Count |
---|---|---|
2022-10-23 | 6 | 1 |
2022-10-23 | 7 | 1 |
2022-10-23 | 8 | 1 |
2022-10-23 | 9 | 1 |
2022-10-23 | 10 | 2 |
2022-10-23 | 11 | 2 |
2022-10-23 | 12 | 2 |
2022-10-23 | 13 | 2 |
2022-10-23 | 14 | 2 |
2022-10-23 | 15 | 2 |
2022-10-23 | 16 | 1 |
2022-10-23 | 17 | 1 |
It should also return this:
Date | Hour | Count |
---|---|---|
2022-10-24 | 1 | 1 |
2022-10-24 | 2 | 1 |
2022-10-24 | 3 | 1 |
2022-10-24 | 4 | 1 |
2022-10-24 | 5 | 1 |
2022-10-24 | 6 | 1 |
2022-10-24 | 7 | 1 |
2022-10-24 | 8 | 1 |
2022-10-24 | 9 | 1 |
2022-10-24 | 10 | 1 |
Perhaps there is a completely different approach that would be best but I've included a demo here
CodePudding user response:
One straightforward way of doing this is:
- generate your calendar first, using the
GENERATE_SERIES
function, beginning with your first date (of your choice, according to the data you have) and running on for 48 times at timestep 1 hour. - left join calendar with your table and check when the calendar datetime is found between your employee's start and end datetime
- aggregate with
COUNT
your employees
WITH calendar AS (
SELECT DATEADD(HOUR, value, '2022-10-23 00:00:00') AS cal_date
FROM GENERATE_SERIES(0, 48, 1)
)
SELECT cal_date,
COUNT(DISTINCT employee) AS headcount
FROM calendar
LEFT JOIN tab
ON calendar.cal_date BETWEEN tab.start_time AND tab.end_time
GROUP BY cal_date
Check the demo here.
Side Note: It's never recommended to split your hours from the date. Prefer having a datetime field, instead of a date integer for hours field. Reason is because SQL Server is able to handle a single datetime field in a better way, using the tools that it makes available for us to use.
If your SQL Server version doesn't have GENERATE_SERIES
at your disposal, you can use a recursive query to generate your calendar:
WITH calendar AS (
SELECT CAST('2022-10-23 00:00:00' AS DATETIME) AS cal_date
UNION ALL
SELECT DATEADD(hh, 1, cal_date) AS cal_date
FROM calendar
WHERE cal_date < '2022-10-25 00:00:00'
)
...
Recursion will begin from the value you set in the first part of the query (namely CAST('2022-10-23 00:00:00' AS DATETIME)
) and will add up 1 hour for each call, till recursion stops at the filtering condition (WHERE cal_date < '2022-10-25 00:00:00'
).
Check the demo with recursion here.