I have the following data (all of the table DDL and data DML is available on the fiddle here (SQL Server) and here (PostgreSQL):
I already have solutions, this question is about efficiency and what the optimal way of doing this would be?
CREATE TABLE ticket
(
ticket_id INTEGER NOT NULL,
working_time VARCHAR (30) NULL DEFAULT NULL CHECK (working_time != '')
);
and data:
ticket_id working_time
18 20.02.2021,15:00,17:00
18 20.02.2021,15:00,17:00
18 20.02.2021,15:00,17:00
20 20.02.2021,12:00,14:15
20 _rubbish__ -- <--- deliberate
20 20.02.2021,12:00,14:15
20
20 21.02.2021,12:00,14:15
20 _rubbish__
20 21.02.2021,12:00,14:15
20
11 rows
The _rubbish__
entry in the data is deliberate - it's free text and I have to be able to cope with poor data!
Now, I want a result like this:
Ticket ID The date hrs_worked_per_ticket
18 2021-02-20 06:00:00
20 2021-02-20 04:30:00
20 2021-02-21 04:30:00
There's no need to tell me that the schema is appalling - I find the idea of storing a date (in non-ISO format) and the times like that in a single line to be abhorrent! There is no choice in this matter.
I have an answer of my own for both PostgreSQL and SQL Server (see below), but I would like to know if there's a more efficient way of doing this?
CodePudding user response:
I have a PostgreSQL solution here - try_cast_time
and try_cast_date
are functions that I wrote, inspired by this post (the whole thread is helpful!):
SELECT DISTINCT
ticket_id,
try_cast_date(working_time)::DATE,
SUM((try_cast_date(working_time) try_cast_time(working_time, 18, 5)) -
(try_cast_date(working_time) try_cast_time(working_time, 12, 5)))
OVER (PARTITION BY ticket_id, try_cast_date(working_time)::DATE)
AS ts_diff
FROM ticket
WHERE try_cast_date(working_time)::DATE IS NOT NULL
ORDER BY ticket_id, try_cast_date(working_time)::DATE
Result:
ticket_id try_cast_date ts_diff
18 2021-02-20 06:00:00
20 2021-02-20 04:30:00
20 2021-02-21 04:30:00
CodePudding user response:
I have a SQL Server solution here (it's pretty horrible!):
WITH cte AS
(
SELECT
ticket_id,
CAST
(
TRY_CONVERT
(
DATE,
SUBSTRING(working_time, 7, 4) '.'
SUBSTRING(working_time, 4, 2) '.'
SUBSTRING(working_time, 1, 2)
) AS DATETIME
)
CAST
(
CAST
(
SUBSTRING
(
working_time, 12, 5
) AS TIME
) AS DATETIME
) AS st_dt,
CAST
(
TRY_CONVERT
(
DATE,
SUBSTRING(working_time, 7, 4) '.'
SUBSTRING(working_time, 4, 2) '.'
SUBSTRING(working_time, 1, 2)
) AS DATETIME
)
CAST
(
CAST
(
SUBSTRING
(
working_time, 18, 5
) AS TIME
) AS DATETIME
) AS et_dt
FROM
ticket
)
SELECT
ticket_id AS "Ticket ID",
TRY_CONVERT(date, et_dt) AS "The date",
TRY_CONVERT
(
VARCHAR(8),
dateadd
(
second,
COALESCE(SUM
(
DATEDIFF(SECOND, st_dt, et_dt)
), 0),
0
),
108
) AS hrs_worked_per_ticket
FROM
cte
WHERE TRY_CONVERT(DATE, et_dt) IS NOT NULL
GROUP BY ticket_id, TRY_CONVERT(DATE, et_dt)
ORDER BY ticket_id, TRY_CONVERT(DATE, et_dt);
Result:
Ticket ID The date hrs_worked_per_ticket
18 2021-02-20 06:00:00
20 2021-02-20 04:30:00
20 2021-02-21 04:30:00
CodePudding user response:
So you have a working version, which, as unwieldy as it is, is not necessarily badly-performing just because of its verboseness.
You asked however if there's a more efficient way, and for SQL Server (I can't comment on Postgres) you can greatly both simplify and improve performance by adding persisted computed columns and a supporting index on the date.
This removes the non-sargability of the query and allows the optimizer to fully utilise indexes for filtering and aggregating, and avoids the minimal overhead of parsing and casting the string values because that work is now done when the row is inserted/updated.
Add the computed columns:
alter table ticket add WorkingDate as Try_convert(date,Concat(Substring(working_time, 7, 4),SUBSTRING(working_time, 4, 2),SUBSTRING(working_time, 1, 2)),112) persisted
alter table ticket add WorkingDuration as DateDiff(minute,Try_convert(time,Substring (working_time, 12, 5),114 ) , Try_convert(time, Substring (working_time, 18, 5),114 )) persisted
Add a supporting index
create clustered index Ix_Id_WorkingDuration on ticket(ticket_id,workingdate)
And then your query becomes:
with w as (
select ticket_Id, workingDate, Sum(workingDuration) d
from
ticket
group by ticket_id, workingDate
)
select ticket_id, workingdate as [The date], FORMAT(d / 60 * 100 d % 60, '#:0#')
from w
where d>0;
Comparing to your original query is not going to yield any noticable improvments on such few rows but would be significantly better performing on a large data set, particulary if you needed to further filter by dates or ranges.
However the estimated execution plan suggests 18% for this version vs 82% for your original version.