Home > Enterprise >  Google Maps SQL Server : calculating outlier geographic data within group
Google Maps SQL Server : calculating outlier geographic data within group

Time:10-02

There are 100 suppliers, each with between 50 and 1000 items. Each supplier may have items close to their office or spread across an entire country or continent.

As LatLngs are input by a human, some mistakes happen. With lots of data and constant 'churn', mistakes are difficult to identify.

To improve data quality, I want to identify outliers for each supplier so that they can be fixed. If a supplier's items are mostly near New York, one in California would be an outlier.

SUPPLIERS

SupplierID int
Latitude DECIMAL(12,9)
Longitude DECIMAL(12,9)

ITEMS

ItemID int
SupplierID int
LatLng geography

I assume I need to use standard deviation for this, but putting it into T-SQL is giving me a headache.

I'd like to output a list of outliers for each supplier, based on each supplier's specific deviation.

This code outputs Items and the distance between each item and the supplier's office.

WITH cte AS 
(
    SELECT 
        ItemID,
        SupplierID,
        LatLng,
        LatLng.STDistance(GEOGRAPHY::Point(a.Latitude, a.Longitude, 4326))/1000 As Distance
    FROM 
        Items v
    JOIN 
        Suppliers a ON v.SupplierID = a.SupplierID
)
SELECT 
    ItemID, SupplierID, Distance 
FROM cte

Here's the SQL functionality for standard deviation (from a blog post):

DECLARE @StdDev DECIMAL(5,2) 
DECLARE @Avg DECIMAL(5,2)

SELECT 
   @StdDev = STDEV(Qty), 
   @Avg = AVG(Qty) 
FROM Sales

SELECT 
   * 
FROM 
   Sales 
WHERE 
   Qty > @Avg - @StdDev AND
   Qty < @Avg   @StdDev

STEPS I NEED TO DO

  1. Calculate STDEV and AVG for distance, GROUP BY SupplierID
  2. Output items where the distance is greater than AVG STDEV for the item's supplier

This is where I'm scratching my head as this is multiple steps AFTER the multiple steps I've already performed. I guess I could insert what I have into a TEMP table and go from there, but is that really the best way?

CodePudding user response:

You can use window functions for this. Both AVG and STDEV are available as window functions

WITH Distances AS 
(
    SELECT 
        i.ItemID,
        s.SupplierID,
        i.LatLng,
        v.SupplierLocation,
        i.LatLng.STDistance(v.SupplierLocation)/1000 As Distance
    FROM 
        Items i
    JOIN 
        Suppliers s ON i.SupplierID = s.SupplierID
    CROSS APPLY (VALUES (
        GEOGRAPHY::Point(s.Latitude, s.Longitude, 4326)
    )) v(SupplierLocation)
),
Averages AS (
    SELECT
        ItemID,
        SupplierID,
        LatLng,
        SupplierLocation
        Distance,
        AVG(Distance)   OVER (PARTITION BY SupplierID) AS Avg,
        STDEV(Distance) OVER (PARTITION BY SupplierID) AS StDev
    FROM
        Distances
)
SELECT 
    ItemID,
    SupplierID,
    Distance,
    Avg,
    StDev
FROM
    Averages
WHERE 
    Distance > Avg - StdDev AND
    Distance < Avg   StdDev;
  • Related