The problem: given a table where a column value marks a boundary, select all sequential rows between boundaries.
An example case:
Suppose you have several devices logging measurements into a table. Suppose also that these devices are known to malfunction, and send an erroneous signal every now and then. Once we receive an erroneous signal, we should mark those rows as suspicious, until we receive another erroneous signal (which means the device has become stable again).
Example fiddle: https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/5247
In our example, the erroneous signal (marked by value 99) separates the dataset as:
- Device A - Lines 1-8: good
- Device A - Lines 9-14: suspicious
- Device A - Lines 15-17: good
- Device A - Lines 18-23: suspicious
- Device B - Lines 24 : good
I'd like to mark rows 9-14 and 18-23 as suspicious. Can this be done using pure SQL, without resorting to cursors?
What I've tried:
with d as (
select *,
CASE
WHEN v=99 THEN 1
ELSE 0
END as mark
FROM test
),
d2 as (
SELECT *,
DENSE_RANK() OVER (PARTITION BY device,mark ORDER BY t) as r
FROM d
)
SELECT *
FROM d2
ORDER BY device,t;
Close, but no cigar. The r(anking) column isn't match of a help.
PS: I tried to make this as generic as possible in order to be useful for other users as well. I think it should be a pretty common problem.
CodePudding user response:
Don't use DENSE_RANK
, but a SUM
window function to count the occurrences of the marker value:
SELECT
SUM((v = 99)::int) OVER (PARTITION BY device ORDER BY t ASC),
*
FROM test
or even easier and more expressively use a filtered COUNT
:
SELECT
COUNT(*) FILTER (WHERE v=99) OVER (PARTITION BY device ORDER BY t ASC),
*
FROM test
If there have been an even amount of markers before, the row is unsuspicious, if there was an odd amount of marker the row is suspicious:
UPDATE test
SET suspect = (marker_count % 2) = 1
FROM (
SELECT
rowid,
COUNT(*) FILTER (WHERE v=99) OVER (PARTITION BY device ORDER BY t ASC) AS marker_count
FROM test
) m
WHERE m.rowid = test.rowid;
CodePudding user response:
Ok I got something that works. Feel free to post something better/more efficient.
with d as (
-- Create the mark column based on measurement
select *,
CASE
WHEN v=99 THEN 1
ELSE 0
END as mark
FROM test
),
d2 AS (
-- Create two helper columns...
SELECT d.*,
ROW_NUMBER() OVER (PARTITION BY device ORDER BY t) AS rn1,
ROW_NUMBER() OVER (PARTITION BY device,mark ORDER BY t) AS rn2
FROM d
),
d3 AS (
-- ...so as to have a "resetting" rank over the device,mark pair.
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY device, rn1-rn2 ORDER BY t) AS rn3
FROM d2
),
d4 AS (
-- Create the sum_mark column (cumulative sum over previous rows, grouped by device)
SELECT
rowid,device,t,v,mark, SUM(mark) OVER (PARTITION BY device ORDER BY t ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sum_mark
FROM d3
)
SELECT rowid,device,t,v,mark,
CASE
WHEN mark=1 THEN 1 -- if you don't want to include boundaries AS suspicious, set this to 0
WHEN MOD(sum_mark,2)=1 THEN 1 -- every odd group is suspicious
ELSE 0
END AS suspicious
FROM d4
ORDER BY device,t