Home > Software engineering >  How to select ordered subsets between boundary values using PostgreSQL
How to select ordered subsets between boundary values using PostgreSQL

Time:07-19

The problem: given a table where a column value marks a boundary, select all sequential rows between boundaries.

An example case:

Suppose you have several devices logging measurements into a table. Suppose also that these devices are known to malfunction, and send an erroneous signal every now and then. Once we receive an erroneous signal, we should mark those rows as suspicious, until we receive another erroneous signal (which means the device has become stable again).

Example fiddle: https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/5247

In our example, the erroneous signal (marked by value 99) separates the dataset as:

  • Device A - Lines 1-8: good
  • Device A - Lines 9-14: suspicious
  • Device A - Lines 15-17: good
  • Device A - Lines 18-23: suspicious
  • Device B - Lines 24 : good

I'd like to mark rows 9-14 and 18-23 as suspicious. Can this be done using pure SQL, without resorting to cursors?

What I've tried:

with d as (
select *,
  CASE 
    WHEN v=99 THEN 1
    ELSE 0
  END as mark
FROM test
), 
d2 as (
SELECT *,
  DENSE_RANK() OVER (PARTITION BY device,mark ORDER BY t) as r
FROM d
)
SELECT *
FROM d2
ORDER BY device,t;

Close, but no cigar. The r(anking) column isn't match of a help.

PS: I tried to make this as generic as possible in order to be useful for other users as well. I think it should be a pretty common problem.

CodePudding user response:

Don't use DENSE_RANK, but a SUM window function to count the occurrences of the marker value:

SELECT
  SUM((v = 99)::int) OVER (PARTITION BY device ORDER BY t ASC),
  *
FROM test

(online demo)

or even easier and more expressively use a filtered COUNT:

SELECT
  COUNT(*) FILTER (WHERE v=99) OVER (PARTITION BY device ORDER BY t ASC),
  *
FROM test

If there have been an even amount of markers before, the row is unsuspicious, if there was an odd amount of marker the row is suspicious:

UPDATE test
SET suspect = (marker_count % 2) = 1
FROM (
  SELECT
    rowid,
    COUNT(*) FILTER (WHERE v=99) OVER (PARTITION BY device ORDER BY t ASC) AS marker_count
  FROM test
) m
WHERE m.rowid = test.rowid;

(online demo)

CodePudding user response:

Ok I got something that works. Feel free to post something better/more efficient.

with d as (
-- Create the mark column based on measurement
select *,
  CASE 
    WHEN v=99 THEN 1
    ELSE 0
  END as mark
FROM test
),
d2 AS (
-- Create two helper columns...
    SELECT d.*,
        ROW_NUMBER() OVER (PARTITION BY device ORDER BY t) AS rn1,
        ROW_NUMBER() OVER (PARTITION BY device,mark ORDER BY t) AS rn2
    FROM d
),
d3 AS (
-- ...so as to have a "resetting" rank over the device,mark pair. 
SELECT 
    *,
    ROW_NUMBER() OVER (PARTITION BY device, rn1-rn2 ORDER BY t) AS rn3
    FROM d2
),
d4 AS (
-- Create the sum_mark column (cumulative sum over previous rows, grouped by device)
SELECT 
    rowid,device,t,v,mark, SUM(mark) OVER (PARTITION  BY device ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sum_mark
FROM d3
)
SELECT rowid,device,t,v,mark,
    CASE
        WHEN mark=1 THEN 1  -- if you don't want to include boundaries AS suspicious, set this to 0
        WHEN MOD(sum_mark,2)=1 THEN 1 -- every odd group is suspicious
        ELSE 0
    END AS suspicious
FROM d4
ORDER BY device,t
  • Related