Home > Blockchain >  Grouping ranges of data
Grouping ranges of data

Time:05-14

I have segmented linear data to identify the minimum and maximum begin and end values of each range, now I want to consolidate overlapping ranges. This is easy if the data is easy. For example, if the data is always increasing for both the begin and end values. (see the first 4 rows in my sample input below) But where the data is not as uniform, I'm having trouble computing the begin and end of a group of overlapping rows by relating data sets.

It's possible I'm headed down a rabbit hole because of an early decision about how to structure my inputs. The source data doesn't contain the rangebegin and rangeend columns I'm using for inputs. This was my choice to try to help me analyze the data. There may be other options.

Since the volume of information below may be an XY problem, here are the basics of what I'm trying to solve.

In my case, I'm trying to identify one-mile segments of roadway where there have been at least 5 crashes, then combine overlapping ranges to draw as a single geometric feature on a map. So...

Given this data:
Crash A at mile 2.22
Crash B at mile 2.24
Crash C at mile 3.10
Crash D at mile 3.92
Crash E at mile 5.03
Crash F at mile 5.21

I would have a segment from 2.22 to 3.92 because there is less than a mile between each crash in that range. Then there's a gap of more than a mile, then the last two crashes are within a mile of each other. So the output would look like

begin end
2.22 3.92
5.03 5.21

Because of some other variables, the way I am combining the data thus far has produced some ranges that are not sequential. Some of the ranges are included in other ranges. That seems to be causing my biggest problems.

I'm using SQL Server (mostly 2016), but the concept should hold for any database system.
Since I'm trying to produce a report, I'd rather not put anything (like a stored procedure) on the database server. So I want to use relational data sets and avoid structured programming involving looping through the data a row at a time.

I have worked up a simple example involving the methods I have previously used to solve similar problems, but this problem has slightly more complicated data that is giving me a headache.

Sample inputs:

drop table if exists #a

create table #a (
    category varchar(5),
    rangebegin int,
    rangeend int
)

insert #a
values 
  ('a', 1, 19)
, ('a', 1, 27)
, ('a', 9, 37)
, ('a', 14, 42)
, ('a', 45, 65)
, ('a', 47, 70)
, ('a', 52, 62)
, ('a', 65, 65)
, ('b', 3, 33)
, ('b', 17, 22)
, ('b', 21, 44)

select *
from #a
category rangebegin rangeend
a 1 19
a 1 27
a 9 37
a 14 42
a 45 65
a 47 70
a 52 62
a 65 67
b 3 33
b 17 22
b 21 44

I'm looking for output like:

category rangebegin rangeend
a 1 42
a 45 70
b 3 44

One attempt was to use LEAD and LAG, identify the first and last row of the segment, the consolidate to just the data I want. This works on the simple case (the first and last groups), but not where a range is contained in the larger range.

Sorry for the verbosity, but I thought it may be helpful to include intermediate outputs within the process.

Here are the steps I have developed:

select category
, rangebegin
, rangeend
, case when lag(rangeend) over (partition by category order by rangebegin) > rangebegin then null else 'begin' end as segmentbegin
, case when lead(rangebegin) over (partition by category order by rangebegin) < rangeend then null else 'end' end as segmentend
into #b
from #a
order by 1, 2, 3

select *
from #b
category rangebegin rangeend segmentbegin segmentend
a 1 19 begin NULL
a 1 27 NULL NULL
a 9 37 NULL NULL
a 14 42 NULL end
a 45 65 begin NULL
a 47 70 NULL NULL
a 52 62 NULL end
a 65 67 begin end -- but this should be in the previous segment because it's all between 45 and 70
b 3 33 begin NULL
b 17 22 NULL NULL
b 21 44 NULL end
select category
, rangebegin
, oldrangeend
, segment
into #c
from (
    select category
    , rangebegin
    , rangeend as oldrangeend
    , segmentbegin as segment
    from #b
    where segmentbegin = 'begin'
    union
    select category
    , rangebegin
    , rangeend
    , segmentend
    from #b
    where segmentend = 'end'
) q

select *
from #c
category rangebegin oldrangeend segment
a 1 19 begin
a 14 42 end
a 45 65 begin
a 52 62 end
a 65 67 begin
a 65 67 end
b 3 33 begin
b 21 44 end
select category
, rangebegin
, oldrangeend
, case when segment = 'begin' then lead(oldrangeend) over (order by category, rangebegin) end as rangeend
into #d
from #c

select category
, rangebegin
, rangeend
from #d
where rangeend is not null
category rangebegin rangeend
a 1 42
a 45 62 -- rangeend should be 70
a 65 67 -- This shouldn't be here because it is included in the previous row.
b 3 44

Notice that the query doesn't identify that the end of the second range is 70. That's because it's inspecting the adjacent row and determining that 65 > 62, so it starts a new group.

CodePudding user response:

I'm not sure how to put everything in one query but using adhoc/Stored procedure can solve this problem.

I have tested in my Sybase ASE, please try same approach with your SQL/Oracle.

Hope this help.

1. Add column SEQ (auto increment id) into source table

SEQ RANGE
1 2.22
2 2.24
3 3.10
4 3.92
5 5.03
6 5.21

2. Looping and calculate difference between 2 records

 CREATE TABLE TEST_RESULT
(
    BEGIN_MILE NUMERIC(10,2),
    END_MILE NUMERIC(10,2)
)

--Begin ad-hoc code
DECLARE @iLoop numeric(10), @recNo numeric(10), @sMsg VARCHAR(100), @tRANGE numeric(10,2), @rangebegin numeric(10,2), @rangeend numeric(10,2)

SET @recNo = (SELECT COUNT(*) FROM LLE_TEST_SEQ)
SET @iLoop = 3

--Set 1st begin
SET @rangebegin = (SELECT RANGE FROM LLE_TEST_SEQ WHERE SEQ = 1)
--Set 1st end
SET @rangeend = (SELECT RANGE FROM LLE_TEST_SEQ WHERE SEQ = 2)

WHILE (@iLoop <= @recNo)
    BEGIN
        SET @tRANGE = (SELECT RANGE FROM LLE_TEST_SEQ WHERE SEQ = @iLoop)
        
        --Display for testing
        PRINT '*****'
        SET @sMsg = CONVERT(VARCHAR, @rangeend)
        PRINT @sMsg     
        SET @sMsg = CONVERT(VARCHAR, @tRANGE)
        PRINT @sMsg
        PRINT '*****'
        
        --If delta < 1, increase end point
        IF @tRANGE - @rangeend < 1
            BEGIN
                SET @rangeend = @tRANGE
            END
        --Otherwise, break, insert and create new set
        ELSE
            BEGIN
                INSERT INTO TEST_RESULT (BEGIN_MILE, END_MILE) VALUES (@rangebegin, @rangeend)
                SET @rangebegin = @tRANGE
                SET @rangeend = @tRANGE
            END     
        
        
        SET @iLoop = @iLoop   1
        
    END
INSERT INTO TEST_RESULT (BEGIN_MILE, END_MILE) VALUES (@rangebegin, @rangeend)
--End ad-hoc code

Output

SELECT * FROM TEST_RESULT
BEGIN_MILE END_MILE
2.22 3.92
5.03 5.21

CodePudding user response:

Adding a column to #a and adding a few more data rows to address some scenarios not covered in original data set:

drop table if exists #a

create table #a
(pass        int
,category    varchar(5)
,rangebegin  int
,rangeend    int)

insert #a
values 
  (1, 'a', 1, 19)
, (1, 'a', 1, 27)
, (1, 'a', 9, 37)
, (1, 'a', 14, 42)
, (1, 'a', 45, 65)
, (1, 'a', 47, 70)
, (1, 'a', 52, 62)
, (1, 'a', 65, 65)

, (1, 'a', 81, 83)       -- standalone row/range
, (1, 'a', 95, 95)       -- standalone row/range; begin == end

, (1, 'b', 3, 33)
, (1, 'b', 17, 22)
, (1, 'b', 21, 44)

, (1, 'c', 30, 35)       -- no two rows
, (1, 'c', 33, 40)       -- define the
, (1, 'c', 39, 42)       -- entire range

One idea using a loop to iteratively consolidate ranges:

declare @pass           int,
        @prevcount      int,
        @currcount      int

select  @pass = 1

select  @prevcount = (select count(*) from #a where pass = @pass),
        @currcount = 0

while   @currcount != @prevcount
begin
        insert  #a
                (pass, category, rangebegin, rangeend)
        select  distinct
                @pass   1,
                a1.category,
                (select min(a2.rangebegin)
                 from   #a a2
                 where  a2.category     = a1.category
                 and    a2.pass         = @pass
                 and    (       a1.rangebegin   between  a2.rangebegin and a2.rangeend
                         or     a1.rangeend     between  a2.rangebegin and a2.rangeend)),
                (select max(a3.rangeend)
                 from   #a a3
                 where  a3.category     = a1.category
                 and    a3.pass         = @pass
                 and    (       a1.rangebegin   between  a3.rangebegin and a3.rangeend
                         or     a1.rangeend     between  a3.rangebegin and a3.rangeend))
        from    #a a1
        where   a1.pass = @pass

        select  @pass      = @pass   1,
                @prevcount = @currcount,
                @currcount = (select count(*) from #a where pass = @pass 1)

--      print '############# %1!', @pass
--      select  * from #a where pass = @pass order by 2,3,4
end

select  category,
        rangebegin,
        rangeend
from    #a
where   pass = @pass
order by 1,2,3

This generates:

 category rangebegin  rangeend
 -------- ----------- -----------
 a                  1          42
 a                 45          70
 a                 81          83
 a                 95          95
 b                  3          44
 c                 30          42

NOTES:

  • SQL Server 2019 fiddle
  • it may be possible to rewrite this using a recursive CTE ... maybe? yes? no? shrug (I'm a bit rusty with recursive CTEs)
  • for larger volumes of data OP may want to look at indexing #a
  • once a new set of data is inserted into #a the 'previous' set of data could be deleted as said data will no longer be needed ... OP's choice
  •  Tags:  
  • sql
  • Related