Intended use case
The user defines a start and end date to search for a given error. Errors may be of type "coming" (= occurred at that point in time, indicated by the value 1
) or "going" (= resolved at that point in time, indicated by the value 0
).
To improve the usefulness of the data, it should also be shown when an error had started and when it was resolved outside the chosen time range.
There is no minimum or maximum duration of how long an error may have lasted. It may be anything between seconds and weeks.
Current solution
The shown data and code are simplified to not make it overly complicated.
Data
Date | Value |
---|---|
2022-08-01 | 1 |
2022-08-03 | 0 |
2022-08-04 | 1 |
2022-08-06 | 0 |
2022-08-07 | 1 |
2022-08-09 | 0 |
Code
The current solution consists of three SELECT
s with two UNION
statements. First find the last occurrence (independent if it is of value 1
or 0
) before the chosen time range, then find the data within the range, lastly find the first occurrence after the time range.
The code would look like below if the user wants to see errors between 3rd August and 9th August.
(
SELECT DATE, value FROM errors
WHERE DATE < '2022-08-03'
ORDER BY DATE DESC
LIMIT 0,1
)
UNION
(
SELECT DATE, value FROM errors
WHERE DATE >= '2022-08-03'
AND DATE <= '2022-08-09'
ORDER BY DATE ASC
)
UNION
(
SELECT DATE, value FROM errors
WHERE DATE > '2022-08-09'
ORDER BY DATE ASC
LIMIT 0,1
)
The application code then checks if the values of the first and third queries are even required (i.e. if the first value inside the chosen time range is 1
- error started -, we don't need the 0
value somewhere before the given range).
The actual table contains data for hundreds of different errors (not shown in this example) with several million records. There is already an index in place to optimize the execution time on DATE (and exception code, as mentioned not shown here).
Optimization possibilities
I wonder how the query might be rewritten to have an optimal query in terms of execution time without changing the underlying data, adding indexes is OK, though. Are there any specific SQL commands available to find these "bounding values" (I copied this term from a Production Historian that has a similar functionality available as part of their API)?
CodePudding user response:
The query you have is best optimized by having an index starting with
(date, value)
(I can't predict how exception_code
would fit in.)
And make it UNION ALL
. (Newer versions have an optimization for ALL
that does not work for DISTINCT
.)
"Hundreds of error codes" -- Consider a SET
, perhaps implemented as several INTs
, with the errors grouped in some logical way.
Are the rows indicating transitions? I see not cases of "0" or "1" showing up twice in a row. This may (or may not) be adding complexity.
I presume your main task is to research the past, not to show the "current" state of each flag?
Based on
WHERE exception_code = ...
AND date BETWEEN ... AND ...
ORDER BY date
I recommend
PRIMARY KEY(exception_code, date)
You said "hundreds" of codes, then
exception_code SMALLINT UNSIGNED
will give you up to 64K codes in only 2 bytes.
I assume the date
might be a DATETIME
or even DATETIME(6)
. But you might want to avoid daylight savings times hiccups by switching to TIMESTAMP
or TIMESTAMP(6)
. (The "6" gives you microseconds, but tha takes a little extra space with probably no extra benefit; adjust accordingly.)