I have geographic products that may have duplicates.
I return a list of possible duplicates and display them on a map for a user to check and delete.
To help the user cross reference between rows, I want to colour code the duplicates. Two or more rows that appear to be duplicates should have the same ColourGroup, so that it is easy to compare the two green rows or the three red rows.
I want to return a unique number for each ColourGroup.
SQL
This returns a list of Products
that may have duplicates based on their Latitude
, Longitude
, ProductType
and Price
/- 5%.
WITH Prods AS
(
SELECT p.ProductID,
p. ProductType,
p.Price,
p.Price ((p.Price/100)*5) As PriceUpper,
p.Price-((p.Price/100)*5) As PriceLower,
Round(p.Latitude,3) As Latitude,
Round(p.Longitude,3) As Longitude
FROM Products p
AND p.Latitude is not null AND p.Longitude is not null
)
SELECT
DISTINCT a.ProductID,
b.ProductID As Duplicate,
a.Latitude,
a.Longitude
FROM Prods a
INNER JOIN Prods b ON a.ProductID <> b.ProductID
AND a.Latitude = b.Latitude
AND a.Longitude = b.Longitude
AND a.ProductType = b.ProductType
AND (b.Price < a.PriceUpper AND b.Price > a.PriceLower)
DESIRED RESULT
Products
ProductID Price Latitude Longitude ProductType
ID1 500 12.34 56.78 Widget
ID2 505 12.34 56.78 Widget
ID3 200 12.34 56.78 Widget
ID4 800 12.34 56.78 Widget
ID5 500 12.34 56.78 Doodad
ID6 300 98.76 54.32 Doodad
ID7 295 98.76 54.32 Doodad
ID8 302 98.76 54.32 Doodad
ID9 100 98.76 54.32 Doodad
ID10 250 12.34 56.78 Thingamy
ID11 600 12.34 56.78 Thingamy
I want to return the following:
ProductID Duplicate Latitude Longitude ColourGroup
ID1 ID2 12.34 56.78 1
ID2 ID1 12.34 56.78 1
ID6 ID7 98.76 54.32 2
ID6 ID8 98.76 54.32 2
ID7 ID6 98.76 54.32 2
ID7 ID8 98.76 54.32 2
ID8 ID6 98.76 54.32 2
ID8 ID7 98.76 54.32 2
ID3 and ID4 do not match ID1 or ID2 because they are outside the /- 5%, and do not match each other. ID5 does not match ID1 or ID2 because it's a different ProductType even though it's in the same location.
ID9 does not match IDs 6, 7 or 8.
ID10 and ID11 do not match each other.
How do I identify and number the sets of duplicates so I can colour code them later?
Ideally, rather than have a thousand colours, the ColourGroup number would reset for each Lat/Lng so I can use a set of about ten colours.
CodePudding user response:
the dense_rank window function is handy here. It will numerate groups based on criteria you've passed in the "over()" part. Try this one
with pct_diff as (
select sd.ProductID,
sd_1.ProductID duplicate,
sd.latitude,
sd.longitude,
sd.producttype,
round((1.0 * min(sd.price) over(partition by sd.latitude,
sd.longitude,
sd.ProductType)) / (1.0 * sd.price) * 100 / 5, 0) pct_diff
from some_data sd
join some_data sd_1
on sd.latitude = sd_1.latitude
and sd.longitude = sd_1.longitude
and sd.ProductType = sd_1.ProductType
and sd.productid <> sd_1.productid
and sd.price between sd_1.price - sd_1.Price * 0.05 and sd_1.price sd_1.Price * 0.05)
select ProductID,
duplicate,
Latitude,
Longitude,
ProductType,
DENSE_RANK() over(order by Latitude,
Longitude,
ProductType,pct_diff) color_group
from pct_diff