I have segmented linear data to identify the minimum and maximum begin and end values of each range, now I want to consolidate overlapping ranges. This is easy if the data is easy. For example, if the data is always increasing for both the begin and end values. (see the first 4 rows in my sample input below) But where the data is not as uniform, I'm having trouble computing the begin and end of a group of overlapping rows by relating data sets.
It's possible I'm headed down a rabbit hole because of an early decision about how to structure my inputs. The source data doesn't contain the rangebegin and rangeend columns I'm using for inputs. This was my choice to try to help me analyze the data. There may be other options.
Since the volume of information below may be an XY problem, here are the basics of what I'm trying to solve.
In my case, I'm trying to identify one-mile segments of roadway where there have been at least 5 crashes, then combine overlapping ranges to draw as a single geometric feature on a map. So...
Given this data:
Crash A at mile 2.22
Crash B at mile 2.24
Crash C at mile 3.10
Crash D at mile 3.92
Crash E at mile 5.03
Crash F at mile 5.21
I would have a segment from 2.22 to 3.92 because there is less than a mile between each crash in that range. Then there's a gap of more than a mile, then the last two crashes are within a mile of each other. So the output would look like
begin | end |
---|---|
2.22 | 3.92 |
5.03 | 5.21 |
Because of some other variables, the way I am combining the data thus far has produced some ranges that are not sequential. Some of the ranges are included in other ranges. That seems to be causing my biggest problems.
I'm using SQL Server (mostly 2016), but the concept should hold for any database system.
Since I'm trying to produce a report, I'd rather not put anything (like a stored procedure) on the database server. So I want to use relational data sets and avoid structured programming involving looping through the data a row at a time.
I have worked up a simple example involving the methods I have previously used to solve similar problems, but this problem has slightly more complicated data that is giving me a headache.
Sample inputs:
drop table if exists #a
create table #a (
category varchar(5),
rangebegin int,
rangeend int
)
insert #a
values
('a', 1, 19)
, ('a', 1, 27)
, ('a', 9, 37)
, ('a', 14, 42)
, ('a', 45, 65)
, ('a', 47, 70)
, ('a', 52, 62)
, ('a', 65, 65)
, ('b', 3, 33)
, ('b', 17, 22)
, ('b', 21, 44)
select *
from #a
category | rangebegin | rangeend |
---|---|---|
a | 1 | 19 |
a | 1 | 27 |
a | 9 | 37 |
a | 14 | 42 |
a | 45 | 65 |
a | 47 | 70 |
a | 52 | 62 |
a | 65 | 67 |
b | 3 | 33 |
b | 17 | 22 |
b | 21 | 44 |
I'm looking for output like:
category | rangebegin | rangeend |
---|---|---|
a | 1 | 42 |
a | 45 | 70 |
b | 3 | 44 |
One attempt was to use LEAD
and LAG
, identify the first and last row of the segment, the consolidate to just the data I want. This works on the simple case (the first and last groups), but not where a range is contained in the larger range.
Sorry for the verbosity, but I thought it may be helpful to include intermediate outputs within the process.
Here are the steps I have developed:
select category
, rangebegin
, rangeend
, case when lag(rangeend) over (partition by category order by rangebegin) > rangebegin then null else 'begin' end as segmentbegin
, case when lead(rangebegin) over (partition by category order by rangebegin) < rangeend then null else 'end' end as segmentend
into #b
from #a
order by 1, 2, 3
select *
from #b
category | rangebegin | rangeend | segmentbegin | segmentend | |
---|---|---|---|---|---|
a | 1 | 19 | begin | NULL | |
a | 1 | 27 | NULL | NULL | |
a | 9 | 37 | NULL | NULL | |
a | 14 | 42 | NULL | end | |
a | 45 | 65 | begin | NULL | |
a | 47 | 70 | NULL | NULL | |
a | 52 | 62 | NULL | end | |
a | 65 | 67 | begin | end | -- but this should be in the previous segment because it's all between 45 and 70 |
b | 3 | 33 | begin | NULL | |
b | 17 | 22 | NULL | NULL | |
b | 21 | 44 | NULL | end |
select category
, rangebegin
, oldrangeend
, segment
into #c
from (
select category
, rangebegin
, rangeend as oldrangeend
, segmentbegin as segment
from #b
where segmentbegin = 'begin'
union
select category
, rangebegin
, rangeend
, segmentend
from #b
where segmentend = 'end'
) q
select *
from #c
category | rangebegin | oldrangeend | segment |
---|---|---|---|
a | 1 | 19 | begin |
a | 14 | 42 | end |
a | 45 | 65 | begin |
a | 52 | 62 | end |
a | 65 | 67 | begin |
a | 65 | 67 | end |
b | 3 | 33 | begin |
b | 21 | 44 | end |
select category
, rangebegin
, oldrangeend
, case when segment = 'begin' then lead(oldrangeend) over (order by category, rangebegin) end as rangeend
into #d
from #c
select category
, rangebegin
, rangeend
from #d
where rangeend is not null
category | rangebegin | rangeend | |
---|---|---|---|
a | 1 | 42 | |
a | 45 | 62 | -- rangeend should be 70 |
a | 65 | 67 | -- This shouldn't be here because it is included in the previous row. |
b | 3 | 44 |
Notice that the query doesn't identify that the end of the second range is 70. That's because it's inspecting the adjacent row and determining that 65 > 62, so it starts a new group.
CodePudding user response:
I'm not sure how to put everything in one query but using adhoc/Stored procedure can solve this problem.
I have tested in my Sybase ASE, please try same approach with your SQL/Oracle.
Hope this help.
1. Add column SEQ (auto increment id) into source table
SEQ | RANGE |
---|---|
1 | 2.22 |
2 | 2.24 |
3 | 3.10 |
4 | 3.92 |
5 | 5.03 |
6 | 5.21 |
2. Looping and calculate difference between 2 records
CREATE TABLE TEST_RESULT
(
BEGIN_MILE NUMERIC(10,2),
END_MILE NUMERIC(10,2)
)
--Begin ad-hoc code
DECLARE @iLoop numeric(10), @recNo numeric(10), @sMsg VARCHAR(100), @tRANGE numeric(10,2), @rangebegin numeric(10,2), @rangeend numeric(10,2)
SET @recNo = (SELECT COUNT(*) FROM LLE_TEST_SEQ)
SET @iLoop = 3
--Set 1st begin
SET @rangebegin = (SELECT RANGE FROM LLE_TEST_SEQ WHERE SEQ = 1)
--Set 1st end
SET @rangeend = (SELECT RANGE FROM LLE_TEST_SEQ WHERE SEQ = 2)
WHILE (@iLoop <= @recNo)
BEGIN
SET @tRANGE = (SELECT RANGE FROM LLE_TEST_SEQ WHERE SEQ = @iLoop)
--Display for testing
PRINT '*****'
SET @sMsg = CONVERT(VARCHAR, @rangeend)
PRINT @sMsg
SET @sMsg = CONVERT(VARCHAR, @tRANGE)
PRINT @sMsg
PRINT '*****'
--If delta < 1, increase end point
IF @tRANGE - @rangeend < 1
BEGIN
SET @rangeend = @tRANGE
END
--Otherwise, break, insert and create new set
ELSE
BEGIN
INSERT INTO TEST_RESULT (BEGIN_MILE, END_MILE) VALUES (@rangebegin, @rangeend)
SET @rangebegin = @tRANGE
SET @rangeend = @tRANGE
END
SET @iLoop = @iLoop 1
END
INSERT INTO TEST_RESULT (BEGIN_MILE, END_MILE) VALUES (@rangebegin, @rangeend)
--End ad-hoc code
Output
SELECT * FROM TEST_RESULT
BEGIN_MILE | END_MILE |
---|---|
2.22 | 3.92 |
5.03 | 5.21 |
CodePudding user response:
Adding a column to #a
and adding a few more data rows to address some scenarios not covered in original data set:
drop table if exists #a
create table #a
(pass int
,category varchar(5)
,rangebegin int
,rangeend int)
insert #a
values
(1, 'a', 1, 19)
, (1, 'a', 1, 27)
, (1, 'a', 9, 37)
, (1, 'a', 14, 42)
, (1, 'a', 45, 65)
, (1, 'a', 47, 70)
, (1, 'a', 52, 62)
, (1, 'a', 65, 65)
, (1, 'a', 81, 83) -- standalone row/range
, (1, 'a', 95, 95) -- standalone row/range; begin == end
, (1, 'b', 3, 33)
, (1, 'b', 17, 22)
, (1, 'b', 21, 44)
, (1, 'c', 30, 35) -- no two rows
, (1, 'c', 33, 40) -- define the
, (1, 'c', 39, 42) -- entire range
One idea using a loop to iteratively consolidate ranges:
declare @pass int,
@prevcount int,
@currcount int
select @pass = 1
select @prevcount = (select count(*) from #a where pass = @pass),
@currcount = 0
while @currcount != @prevcount
begin
insert #a
(pass, category, rangebegin, rangeend)
select distinct
@pass 1,
a1.category,
(select min(a2.rangebegin)
from #a a2
where a2.category = a1.category
and a2.pass = @pass
and ( a1.rangebegin between a2.rangebegin and a2.rangeend
or a1.rangeend between a2.rangebegin and a2.rangeend)),
(select max(a3.rangeend)
from #a a3
where a3.category = a1.category
and a3.pass = @pass
and ( a1.rangebegin between a3.rangebegin and a3.rangeend
or a1.rangeend between a3.rangebegin and a3.rangeend))
from #a a1
where a1.pass = @pass
select @pass = @pass 1,
@prevcount = @currcount,
@currcount = (select count(*) from #a where pass = @pass 1)
-- print '############# %1!', @pass
-- select * from #a where pass = @pass order by 2,3,4
end
select category,
rangebegin,
rangeend
from #a
where pass = @pass
order by 1,2,3
This generates:
category rangebegin rangeend
-------- ----------- -----------
a 1 42
a 45 70
a 81 83
a 95 95
b 3 44
c 30 42
NOTES:
- SQL Server 2019 fiddle
- it may be possible to rewrite this using a recursive CTE ... maybe? yes? no? shrug (I'm a bit rusty with recursive CTEs)
- for larger volumes of data OP may want to look at indexing
#a
- once a new set of data is inserted into
#a
the 'previous' set of data could be deleted as said data will no longer be needed ... OP's choice