Home > Software design >  Select duplicates (including near-duplicate numbers)
Select duplicates (including near-duplicate numbers)

Time:10-19

I have a ROAD_PROJECTS table in Oracle 18c:

with road_projects (proj_id, road_id, year_, status, from_measure, to_measure) as (

select 100, 1, 2022, 'APPROVED',  null, 100.1 from dual union all 
select 101, 1, 2022, 'APPROVED',     0, 100.1 from dual union all

select 102, 1, 2022, 'APPROVED',     0, 200.6 from dual union all
select 103, 1, 2022, 'APPROVED',     0, 199.3 from dual union all
select 104, 1, 2022, 'APPROVED',     0,   201 from dual union all

select 105, 2, 2023, 'PROPOSED',     0,    50 from dual union all
select 106, 2, 2023, 'PROPOSED',    75,   100 from dual union all

select 107, 3, 2024, 'DEFERRED',     0,   100 from dual union all
select 108, 3, 2025, 'DEFERRED',     0,   110 from dual union all

select 109, 4, 2026, 'PROPOSED',     0,  null from dual union all
select 110, 4, 2026, 'DEFERRED',     0,  null from dual)

select * from road_projects
   PROJ_ID    ROAD_ID      YEAR_ STATUS   FROM_MEASURE TO_MEASURE
---------- ---------- ---------- -------- ------------ ----------
       100          1       2022 APPROVED         null      100.1 --duplicates (other than PROJ_ID); null is to be treated as zero
       101          1       2022 APPROVED            0      100.1

       102          1       2022 APPROVED            0      200.6 --duplicates: TO_MEASURES are approximately the same (within a 5 metre tolerance)
       103          1       2022 APPROVED            0      199.3
       104          1       2022 APPROVED            0        201

       105          2       2023 PROPOSED            0         50 --not duplicates: FROM_MEASURES are different and TO_MEASURES are different
       106          2       2023 PROPOSED           75        100

       107          3       2024 DEFERRED            0        100 --not duplicates: YEARS are different and TO_MEASURES are different
       108          3       2025 DEFERRED            0        110

       109          4       2026 PROPOSED            0       null  --not duplicates: STATUSES are different
       110          4       2026 DEFERRED            0       null

DB<>fiddle


I want to select rows where the ROAD_ID, YEAR_, STATUS, FROM_MEASURES, and TO_MEASURES are duplicated.

  • In the case of FROM_MEASURES and TO_MEASURES, I want to use a 5 metre tolerance. For example, these TO_MEASURES would be considered duplicates: 200.6, 199.3, and 201.

    • Note: To be honest, I haven't wrapped my head around how the tolerance would work. Would it be 5 /- from the "middle" of the range? I'm open to ideas -- whatever's easiest.
  • When comparing FROM_MEASURES or TO_MEASURES, I want to treat null as 0. But when it comes to the output, returning either null or 0 would be fine.

The result would look like this:

   PROJ_ID    ROAD_ID      YEAR_ STATUS   FROM_MEASURE TO_MEASURE
---------- ---------- ---------- -------- ------------ ----------
       100          1       2022 APPROVED         null      100.1 --duplicates
       101          1       2022 APPROVED            0      100.1

       102          1       2022 APPROVED            0      200.6 --duplicates
       103          1       2022 APPROVED            0      199.3
       104          1       2022 APPROVED            0        201

How can I select duplicates -- including near-duplicate numbers?

CodePudding user response:

You could use an exists clause to look for a close duplicate:

select * from road_projects rp
where exists (
  select null
  from road_projects rp2
  where rp2.proj_id != rp.proj_id
  and rp2.road_id = rp.road_id
  and rp2.year_ = rp.year_
  and rp2.status = rp.status
  and abs(coalesce(rp2.from_measure, 0) - coalesce(rp.from_measure, 0)) < 5
  and abs(coalesce(rp2.to_measure, 0) - coalesce(rp.to_measure, 0)) < 5
)
order by proj_id
PROJ_ID ROAD_ID YEAR_ STATUS FROM_MEASURE TO_MEASURE
100 1 2022 APPROVED null 100.1
101 1 2022 APPROVED 0 100.1
102 1 2022 APPROVED 0 200.6
103 1 2022 APPROVED 0 199.3
104 1 2022 APPROVED 0 201

fiddle

The tolerance is applied by subtracting the two from (or to) measures, getting the absolute (unsigned) value for that difference, and seeing if that is less than 5.


Partly for fun, partly because of @Jorge's comment... you could also do this with match_recognize pattern matching, which would allow you to keep track of which duplicates go together, and the min/max from/to across a group of duplicates, if that sort of thing might be useful:

select proj_id, road_id, year_, status, from_measure, to_measure,
  from_tolerance, to_tolerance,
  dupe_group, first_proj_id, first_from_measure, last_to_measure
from road_projects
match_recognize (
  partition by road_id, year_, status
  order by from_measure nulls first, to_measure
  measures
    match_number() as dupe_group,
    first(proj_id) as first_proj_id,
    first(from_measure) as first_from_measure,
    final last(to_measure) as last_to_measure,
    abs(coalesce(dupe.from_measure, 0) - coalesce(prev(dupe.from_measure), 0)) as from_tolerance,
    abs(coalesce(dupe.to_measure, 0) - coalesce(prev(dupe.to_measure), 0)) as to_tolerance
  all rows per match
  after match skip past last row
  pattern (orig dupe )
  define
    dupe as abs(coalesce(dupe.from_measure, 0) - coalesce(prev(dupe.from_measure), 0)) < 5
      and abs(coalesce(dupe.to_measure, 0) - coalesce(prev(dupe.to_measure), 0)) < 5
)
order by proj_id
PROJ_ID ROAD_ID YEAR_ STATUS FROM_MEASURE TO_MEASURE FROM_TOLERANCE TO_TOLERANCE DUPE_GROUP FIRST_PROJ_ID FIRST_FROM_MEASURE LAST_TO_MEASURE
100 1 2022 APPROVED null 100.1 0 0 1 100 null 100.1
101 1 2022 APPROVED 0 100.1 0 0 1 100 null 100.1
102 1 2022 APPROVED 0 200.6 0 1.3 2 103 0 201
103 1 2022 APPROVED 0 199.3 0 0 2 103 0 201
104 1 2022 APPROVED 0 201 0 .4 2 103 0 201

fiddle

CodePudding user response:

So, coalesce is going to be needed to get null to = 0. Then, you can just join the table on itself and look for what you want. My example:

select distinct
r.proj_id
, r.road_id
, r.year_
, r.from_measure
, r.to_measure
from road_projects r
left join road_projects r2 on r2.proj_id != r.proj_id and r2.road_id = r.road_id and r2.year_ = r.year_ and coalesce(r2.from_measure, 0) = coalesce(r.from_measure, 0) and (r2.to_measure < r.to_measure   5 and r2.to_measure > r.to_measure -5)

where r2.proj_id is not null

this should give you what is needed.

  • Related