Optimal approach to finding "bounding values" in MariaDB-CodePudding

Intended use case

The user defines a start and end date to search for a given error. Errors may be of type "coming" (= occurred at that point in time, indicated by the value 1) or "going" (= resolved at that point in time, indicated by the value 0).

To improve the usefulness of the data, it should also be shown when an error had started and when it was resolved outside the chosen time range.

There is no minimum or maximum duration of how long an error may have lasted. It may be anything between seconds and weeks.

Current solution

The shown data and code are simplified to not make it overly complicated.

Data

Date	Value
2022-08-01	1
2022-08-03	0
2022-08-04	1
2022-08-06	0
2022-08-07	1
2022-08-09	0

Code

The current solution consists of three SELECTs with two UNION statements. First find the last occurrence (independent if it is of value 1 or 0) before the chosen time range, then find the data within the range, lastly find the first occurrence after the time range.

The code would look like below if the user wants to see errors between 3rd August and 9th August.

(
  SELECT DATE, value FROM errors
  WHERE DATE < '2022-08-03'
  ORDER BY DATE DESC
  LIMIT 0,1
)
UNION
(
  SELECT DATE, value FROM errors
  WHERE DATE >= '2022-08-03'
  AND DATE <= '2022-08-09'
  ORDER BY DATE ASC
)
UNION
(
  SELECT DATE, value FROM errors
  WHERE DATE > '2022-08-09'
  ORDER BY DATE ASC
  LIMIT 0,1
)

The application code then checks if the values of the first and third queries are even required (i.e. if the first value inside the chosen time range is 1 - error started -, we don't need the 0 value somewhere before the given range).

The actual table contains data for hundreds of different errors (not shown in this example) with several million records. There is already an index in place to optimize the execution time on DATE (and exception code, as mentioned not shown here).

Optimization possibilities

I wonder how the query might be rewritten to have an optimal query in terms of execution time without changing the underlying data, adding indexes is OK, though. Are there any specific SQL commands available to find these "bounding values" (I copied this term from a Production Historian that has a similar functionality available as part of their API)?

CodePudding user response：

The query you have is best optimized by having an index starting with

(date, value)

(I can't predict how exception_code would fit in.)

And make it UNION ALL. (Newer versions have an optimization for ALL that does not work for DISTINCT.)

"Hundreds of error codes" -- Consider a SET, perhaps implemented as several INTs, with the errors grouped in some logical way.

Are the rows indicating transitions? I see not cases of "0" or "1" showing up twice in a row. This may (or may not) be adding complexity.

I presume your main task is to research the past, not to show the "current" state of each flag?

Based on

WHERE exception_code = ... 
  AND date BETWEEN ... AND ... 
ORDER BY date

I recommend

PRIMARY KEY(exception_code, date)

You said "hundreds" of codes, then

exception_code SMALLINT UNSIGNED

will give you up to 64K codes in only 2 bytes.

I assume the date might be a DATETIME or even DATETIME(6). But you might want to avoid daylight savings times hiccups by switching to TIMESTAMP or TIMESTAMP(6). (The "6" gives you microseconds, but tha takes a little extra space with probably no extra benefit; adjust accordingly.)