I have a ROAD_PROJECTS table in Oracle 18c:
with road_projects (proj_id, road_id, year_, status, from_measure, to_measure) as (
select 100, 1, 2022, 'APPROVED', null, 100.1 from dual union all
select 101, 1, 2022, 'APPROVED', 0, 100.1 from dual union all
select 102, 1, 2022, 'APPROVED', 0, 200.6 from dual union all
select 103, 1, 2022, 'APPROVED', 0, 199.3 from dual union all
select 104, 1, 2022, 'APPROVED', 0, 201 from dual union all
select 105, 2, 2023, 'PROPOSED', 0, 50 from dual union all
select 106, 2, 2023, 'PROPOSED', 75, 100 from dual union all
select 107, 3, 2024, 'DEFERRED', 0, 100 from dual union all
select 108, 3, 2025, 'DEFERRED', 0, 110 from dual union all
select 109, 4, 2026, 'PROPOSED', 0, null from dual union all
select 110, 4, 2026, 'DEFERRED', 0, null from dual)
select * from road_projects
PROJ_ID ROAD_ID YEAR_ STATUS FROM_MEASURE TO_MEASURE
---------- ---------- ---------- -------- ------------ ----------
100 1 2022 APPROVED null 100.1 --duplicates (other than PROJ_ID); null is to be treated as zero
101 1 2022 APPROVED 0 100.1
102 1 2022 APPROVED 0 200.6 --duplicates: TO_MEASURES are approximately the same (within a 5 metre tolerance)
103 1 2022 APPROVED 0 199.3
104 1 2022 APPROVED 0 201
105 2 2023 PROPOSED 0 50 --not duplicates: FROM_MEASURES are different and TO_MEASURES are different
106 2 2023 PROPOSED 75 100
107 3 2024 DEFERRED 0 100 --not duplicates: YEARS are different and TO_MEASURES are different
108 3 2025 DEFERRED 0 110
109 4 2026 PROPOSED 0 null --not duplicates: STATUSES are different
110 4 2026 DEFERRED 0 null
I want to select rows where the ROAD_ID, YEAR_, STATUS, FROM_MEASURES, and TO_MEASURES are duplicated.
In the case of FROM_MEASURES and TO_MEASURES, I want to use a 5 metre tolerance. For example, these TO_MEASURES would be considered duplicates: 200.6, 199.3, and 201.
- Note: To be honest, I haven't wrapped my head around how the tolerance would work. Would it be 5 /- from the "middle" of the range? I'm open to ideas -- whatever's easiest.
When comparing FROM_MEASURES or TO_MEASURES, I want to treat null as 0. But when it comes to the output, returning either null or 0 would be fine.
The result would look like this:
PROJ_ID ROAD_ID YEAR_ STATUS FROM_MEASURE TO_MEASURE
---------- ---------- ---------- -------- ------------ ----------
100 1 2022 APPROVED null 100.1 --duplicates
101 1 2022 APPROVED 0 100.1
102 1 2022 APPROVED 0 200.6 --duplicates
103 1 2022 APPROVED 0 199.3
104 1 2022 APPROVED 0 201
How can I select duplicates -- including near-duplicate numbers?
CodePudding user response:
You could use an exists
clause to look for a close duplicate:
select * from road_projects rp
where exists (
select null
from road_projects rp2
where rp2.proj_id != rp.proj_id
and rp2.road_id = rp.road_id
and rp2.year_ = rp.year_
and rp2.status = rp.status
and abs(coalesce(rp2.from_measure, 0) - coalesce(rp.from_measure, 0)) < 5
and abs(coalesce(rp2.to_measure, 0) - coalesce(rp.to_measure, 0)) < 5
)
order by proj_id
PROJ_ID | ROAD_ID | YEAR_ | STATUS | FROM_MEASURE | TO_MEASURE |
---|---|---|---|---|---|
100 | 1 | 2022 | APPROVED | null | 100.1 |
101 | 1 | 2022 | APPROVED | 0 | 100.1 |
102 | 1 | 2022 | APPROVED | 0 | 200.6 |
103 | 1 | 2022 | APPROVED | 0 | 199.3 |
104 | 1 | 2022 | APPROVED | 0 | 201 |
The tolerance is applied by subtracting the two from (or to) measures, getting the absolute (unsigned) value for that difference, and seeing if that is less than 5.
Partly for fun, partly because of @Jorge's comment... you could also do this with match_recognize
pattern matching, which would allow you to keep track of which duplicates go together, and the min/max from/to across a group of duplicates, if that sort of thing might be useful:
select proj_id, road_id, year_, status, from_measure, to_measure,
from_tolerance, to_tolerance,
dupe_group, first_proj_id, first_from_measure, last_to_measure
from road_projects
match_recognize (
partition by road_id, year_, status
order by from_measure nulls first, to_measure
measures
match_number() as dupe_group,
first(proj_id) as first_proj_id,
first(from_measure) as first_from_measure,
final last(to_measure) as last_to_measure,
abs(coalesce(dupe.from_measure, 0) - coalesce(prev(dupe.from_measure), 0)) as from_tolerance,
abs(coalesce(dupe.to_measure, 0) - coalesce(prev(dupe.to_measure), 0)) as to_tolerance
all rows per match
after match skip past last row
pattern (orig dupe )
define
dupe as abs(coalesce(dupe.from_measure, 0) - coalesce(prev(dupe.from_measure), 0)) < 5
and abs(coalesce(dupe.to_measure, 0) - coalesce(prev(dupe.to_measure), 0)) < 5
)
order by proj_id
PROJ_ID | ROAD_ID | YEAR_ | STATUS | FROM_MEASURE | TO_MEASURE | FROM_TOLERANCE | TO_TOLERANCE | DUPE_GROUP | FIRST_PROJ_ID | FIRST_FROM_MEASURE | LAST_TO_MEASURE |
---|---|---|---|---|---|---|---|---|---|---|---|
100 | 1 | 2022 | APPROVED | null | 100.1 | 0 | 0 | 1 | 100 | null | 100.1 |
101 | 1 | 2022 | APPROVED | 0 | 100.1 | 0 | 0 | 1 | 100 | null | 100.1 |
102 | 1 | 2022 | APPROVED | 0 | 200.6 | 0 | 1.3 | 2 | 103 | 0 | 201 |
103 | 1 | 2022 | APPROVED | 0 | 199.3 | 0 | 0 | 2 | 103 | 0 | 201 |
104 | 1 | 2022 | APPROVED | 0 | 201 | 0 | .4 | 2 | 103 | 0 | 201 |
CodePudding user response:
So, coalesce is going to be needed to get null to = 0. Then, you can just join the table on itself and look for what you want. My example:
select distinct
r.proj_id
, r.road_id
, r.year_
, r.from_measure
, r.to_measure
from road_projects r
left join road_projects r2 on r2.proj_id != r.proj_id and r2.road_id = r.road_id and r2.year_ = r.year_ and coalesce(r2.from_measure, 0) = coalesce(r.from_measure, 0) and (r2.to_measure < r.to_measure 5 and r2.to_measure > r.to_measure -5)
where r2.proj_id is not null
this should give you what is needed.