Home > Software engineering >  Parsing a string in both PostgreSQL and Micrsoft SQL Server
Parsing a string in both PostgreSQL and Micrsoft SQL Server

Time:10-04

I have the following data (all of the table DDL and data DML is available on the fiddle here (SQL Server) and here (PostgreSQL):

I already have solutions, this question is about efficiency and what the optimal way of doing this would be?

CREATE TABLE ticket
(
  ticket_id INTEGER NOT NULL,
  working_time VARCHAR (30) NULL DEFAULT NULL CHECK (working_time != '')
);

and data:

ticket_id   working_time
       18   20.02.2021,15:00,17:00
       18   20.02.2021,15:00,17:00
       18   20.02.2021,15:00,17:00
       20   20.02.2021,12:00,14:15
       20   _rubbish__              --  <---   deliberate
       20   20.02.2021,12:00,14:15  
       20   
       20   21.02.2021,12:00,14:15
       20   _rubbish__
       20   21.02.2021,12:00,14:15
       20   
11 rows

The _rubbish__ entry in the data is deliberate - it's free text and I have to be able to cope with poor data!

Now, I want a result like this:

Ticket ID     The date  hrs_worked_per_ticket
       18   2021-02-20               06:00:00
       20   2021-02-20               04:30:00
       20   2021-02-21               04:30:00

There's no need to tell me that the schema is appalling - I find the idea of storing a date (in non-ISO format) and the times like that in a single line to be abhorrent! There is no choice in this matter.

I have an answer of my own for both PostgreSQL and SQL Server (see below), but I would like to know if there's a more efficient way of doing this?

CodePudding user response:

I have a PostgreSQL solution here - try_cast_time and try_cast_date are functions that I wrote, inspired by this post (the whole thread is helpful!):

SELECT DISTINCT
  ticket_id,
  try_cast_date(working_time)::DATE,
  SUM((try_cast_date(working_time)   try_cast_time(working_time, 18, 5)) - 
  (try_cast_date(working_time)   try_cast_time(working_time, 12, 5))) 
    OVER (PARTITION BY ticket_id, try_cast_date(working_time)::DATE)
  AS ts_diff
FROM ticket
WHERE try_cast_date(working_time)::DATE IS NOT NULL
ORDER BY ticket_id, try_cast_date(working_time)::DATE

Result:

ticket_id   try_cast_date    ts_diff
       18      2021-02-20   06:00:00
       20      2021-02-20   04:30:00
       20       2021-02-21  04:30:00

CodePudding user response:

I have a SQL Server solution here (it's pretty horrible!):

WITH cte AS
(
  SELECT
    ticket_id,
    CAST
    (
      TRY_CONVERT
      (
        DATE, 
        SUBSTRING(working_time, 7, 4)   '.'   
        SUBSTRING(working_time, 4, 2)   '.'  
        SUBSTRING(working_time, 1, 2)
      ) AS DATETIME
    ) 
      
    CAST
    (
      CAST
      (
        SUBSTRING
        (
          working_time, 12,  5
        ) AS TIME
      ) AS DATETIME
    ) AS st_dt,
    CAST
    (
      TRY_CONVERT
      (
        DATE, 
        SUBSTRING(working_time, 7, 4)   '.'   
        SUBSTRING(working_time, 4, 2)   '.'  
        SUBSTRING(working_time, 1, 2)
      ) AS DATETIME
    ) 
      
    CAST
    (
      CAST
      (
        SUBSTRING
        (
          working_time, 18,  5
        ) AS TIME
      ) AS DATETIME
    ) AS et_dt
  FROM
    ticket
)
SELECT
  ticket_id AS "Ticket ID",
  TRY_CONVERT(date, et_dt) AS "The date",
  TRY_CONVERT
  (
    VARCHAR(8), 
    dateadd
    (
      second, 
      COALESCE(SUM
      (
        DATEDIFF(SECOND, st_dt, et_dt)
      ), 0), 
      0
    ),  
    108
  ) AS hrs_worked_per_ticket
FROM 
  cte
WHERE TRY_CONVERT(DATE, et_dt) IS NOT NULL
GROUP BY ticket_id, TRY_CONVERT(DATE, et_dt)
ORDER BY ticket_id, TRY_CONVERT(DATE, et_dt);

Result:

Ticket ID   The date    hrs_worked_per_ticket
       18   2021-02-20               06:00:00
       20   2021-02-20               04:30:00
       20   2021-02-21               04:30:00

CodePudding user response:

So you have a working version, which, as unwieldy as it is, is not necessarily badly-performing just because of its verboseness.

You asked however if there's a more efficient way, and for SQL Server (I can't comment on Postgres) you can greatly both simplify and improve performance by adding persisted computed columns and a supporting index on the date.

This removes the non-sargability of the query and allows the optimizer to fully utilise indexes for filtering and aggregating, and avoids the minimal overhead of parsing and casting the string values because that work is now done when the row is inserted/updated.

Add the computed columns:

alter table ticket add WorkingDate as Try_convert(date,Concat(Substring(working_time, 7, 4),SUBSTRING(working_time, 4, 2),SUBSTRING(working_time, 1, 2)),112) persisted 
alter table ticket add WorkingDuration as  DateDiff(minute,Try_convert(time,Substring (working_time, 12, 5),114 ) , Try_convert(time, Substring (working_time, 18, 5),114 )) persisted

Add a supporting index

create clustered index Ix_Id_WorkingDuration on ticket(ticket_id,workingdate)

And then your query becomes:

with w as (
    select ticket_Id, workingDate, Sum(workingDuration) d
    from
    ticket
    group by ticket_id, workingDate
)
select ticket_id, workingdate as [The date], FORMAT(d / 60 * 100   d % 60, '#:0#')
from w
where d>0;

See amended Fiddle

Comparing to your original query is not going to yield any noticable improvments on such few rows but would be significantly better performing on a large data set, particulary if you needed to further filter by dates or ranges.

However the estimated execution plan suggests 18% for this version vs 82% for your original version.

  • Related