Home > OS >  SQL Server : Grouping sets of duplicate rows
SQL Server : Grouping sets of duplicate rows

Time:10-23

I have geographic products that may have duplicates.

I return a list of possible duplicates and display them on a map for a user to check and delete.

To help the user cross reference between rows, I want to colour code the duplicates. Two or more rows that appear to be duplicates should have the same ColourGroup, so that it is easy to compare the two green rows or the three red rows.

I want to return a unique number for each ColourGroup.

SQL

This returns a list of Products that may have duplicates based on their Latitude, Longitude, ProductType and Price /- 5%.

WITH Prods AS
(
SELECT p.ProductID,
p. ProductType,
p.Price,
p.Price ((p.Price/100)*5) As PriceUpper,
p.Price-((p.Price/100)*5) As PriceLower,
Round(p.Latitude,3) As Latitude,
Round(p.Longitude,3) As Longitude
FROM Products p
AND p.Latitude is not null AND p.Longitude is not null
)

SELECT 
DISTINCT a.ProductID,
b.ProductID As Duplicate,
a.Latitude,
a.Longitude
FROM Prods a

INNER JOIN Prods b ON a.ProductID <> b.ProductID
AND a.Latitude = b.Latitude 
AND a.Longitude = b.Longitude
AND a.ProductType = b.ProductType

AND (b.Price < a.PriceUpper AND b.Price > a.PriceLower)

DESIRED RESULT

Products

ProductID   Price    Latitude   Longitude   ProductType
ID1         500      12.34      56.78       Widget
ID2         505      12.34      56.78       Widget
ID3         200      12.34      56.78       Widget
ID4         800      12.34      56.78       Widget
ID5         500      12.34      56.78       Doodad

ID6         300      98.76      54.32       Doodad
ID7         295      98.76      54.32       Doodad
ID8         302      98.76      54.32       Doodad
ID9         100      98.76      54.32       Doodad

ID10        250      12.34      56.78       Thingamy
ID11        600      12.34      56.78       Thingamy

I want to return the following:

ProductID   Duplicate  Latitude   Longitude   ColourGroup
ID1         ID2        12.34      56.78       1
ID2         ID1        12.34      56.78       1

ID6         ID7        98.76      54.32       2
ID6         ID8        98.76      54.32       2
ID7         ID6        98.76      54.32       2
ID7         ID8        98.76      54.32       2
ID8         ID6        98.76      54.32       2
ID8         ID7        98.76      54.32       2

ID3 and ID4 do not match ID1 or ID2 because they are outside the /- 5%, and do not match each other. ID5 does not match ID1 or ID2 because it's a different ProductType even though it's in the same location.

ID9 does not match IDs 6, 7 or 8.

ID10 and ID11 do not match each other.

How do I identify and number the sets of duplicates so I can colour code them later?

Ideally, rather than have a thousand colours, the ColourGroup number would reset for each Lat/Lng so I can use a set of about ten colours.

CodePudding user response:

the dense_rank window function is handy here. It will numerate groups based on criteria you've passed in the "over()" part. Try this one

with pct_diff as (
select sd.ProductID, 
           sd_1.ProductID duplicate, 
           sd.latitude, 
           sd.longitude,
           sd.producttype,
           round((1.0 * min(sd.price) over(partition by sd.latitude, 
                                    sd.longitude, 
                                    sd.ProductType)) / (1.0 * sd.price) * 100 / 5, 0) pct_diff
      from some_data sd
      join some_data sd_1
        on sd.latitude = sd_1.latitude
       and sd.longitude = sd_1.longitude
       and sd.ProductType = sd_1.ProductType
       and sd.productid <> sd_1.productid
       and sd.price between sd_1.price - sd_1.Price * 0.05 and sd_1.price   sd_1.Price * 0.05)

  select ProductID,
         duplicate,
         Latitude,
         Longitude,
         ProductType,
         DENSE_RANK() over(order by Latitude,
         Longitude,
         ProductType,pct_diff) color_group
    from pct_diff   
  • Related