Home > Enterprise >  Select outliers in table
Select outliers in table

Time:09-16

I have a table with about 100 000 names/rows that look something like this. There are about 3000 different Refnrs. The names are clustered around the Refnr geographically. The problem is that there are some names that have the wrong location. I need to find the rows who dont fit in with the others. I figured I would do this by finding the Latidude OR Longitude that is too far away from the Longitude and Latitude in the rest of the same Refnrs. So if you see the first Refnr they two of them are located at Latitude 10.67xxx, and 1 is located at Latitude 10.34xxx.

So if I say that I want to compare all the names in the different Refnrs and sort out where the 2nd decimal number differs from the rest of the names.

Is there any way to do this so that I dont have to manually run a query 3000 times?

Refnr Latitude Longitude Name
123 10.67643 50.67523 bob
123 10.67143 50.67737 joe
123 10.34133 50.67848 al
234 11.56892 50.12324 berny
234 11.56123 50.12432 bonny
234 11.98135 50.12223 arby
567 10.22892 50.67143 nilly
567 10.22123 50.67236 tilly
567 10.22148 50.22422 billy

I need a select to give me this.

Refnr Latitude Longitude Name
123 10.34133 50.67848 al
234 11.98135 50.12223 arby
567 10.22148 50.22422 billy

Thanks for the help.

CodePudding user response:

This got overly complex, and may not be that useful. Still, it was interesting to work on.

First, set up the test data

DROP TABLE #Test
GO

CREATE TABLE #Test
 (
   Refnr      int           not null
  ,Latitude   decimal(7,5)  not null
  ,Longitude  decimal(7,5)  not null
  ,Name       varchar(100)  not null
)

INSERT #Test VALUES
  (123, 10.67643, 50.67523, 'bob')
 ,(123, 10.67143, 50.67737, 'joe')
 ,(123, 10.34133, 50.67848, 'al')
 ,(234, 11.56892, 50.12324, 'berny')
 ,(234, 11.56123, 50.12432, 'bonny')
 ,(234, 11.98135, 50.12223, 'arby')
 ,(567, 10.22892, 50.67143, 'nilly')
 ,(567, 10.22123, 50.67236, 'tilly')
 ,(567, 10.22148, 50.22422, 'billy')

SELECT *
 from #Test

As requirements are a tad imprecise, use this to round lat, lon to the desired precision. Adjust as necessary.

DECLARE  @Precision  TINYINT  = 1

--SELECT
--   Latitude
--  ,round(Latitude, @Precision)
-- from #Test

Then it gets messy. Problems will up with if there are multiple "outliers", by EITHER latitude OR longitude. I think this will account for all, and remove duplicates, but further review and testing is called for.

;WITH cteGroups as (

--  Set up groups by lat/lon proximity
SELECT
   Refnr
  ,'Latitude'  Type
  ,round(Latitude, @Precision)  Proximity
  ,count(*)  HowMany
 from #Test
 group by
   Refnr
  ,round(Latitude, @Precision)
UNION ALL SELECT
   Refnr
  ,'Longitude'  Type
  ,round(Longitude, @Precision)  Proximity
  ,count(*)  HowMany
 from #Test
 group by
   Refnr
  ,round(Longitude, @Precision)

)
,cteOutliers as (

--  Identify outliers
select
   Type
  ,Refnr
  ,Proximity
  ,row_number() over (partition by Type, Refnr order by HowMany desc)  Ranking
 from cteGroups

)

--  Pull out all items that match with outliers
select te.*
 from cteOutliers  cte
  inner join #Test  te
   on te.Refnr = cte.Refnr
    and (   (cte.Type = 'Latitude' and round(te.Latitude, @Precision) = Proximity)
         or (cte.Type = 'Longitude' and round(te.Longitude, @Precision) = Proximity) )
 where cte.Ranking > 1  --  Not in the larger groups

CodePudding user response:

Here's what is hopefully a working solution - it gives the 3 outliers from your sample data, will be interesting to see if it works on your larger data set.

Create a CTE for each longitude and latitude, count the number of matching values based on first 2 decimal places only and choose the minimum of each group - that's the group's outlier.

Join the results with the main table and filter to only rows matching the outlier lat or long.

with outlierLat as (
    select top (1) with ties refnr,  Round(latitude,2,1) latitude
    from t
    group by refnr, Round(latitude,2,1)
    order by Count(*) 
), outlierLong as (
    select top (1) with ties refnr,  Round(Longitude,2,1) Longitude
    from t
    group by refnr, Round(Longitude,2,1)
    order by Count(*) 
)
select t.* 
from t
left join outlierLat lt on lt.refnr=t.refnr and Round(t.latitude,2,1)=lt.latitude
left join outlierLong lo on lo.refnr=t.refnr and Round(t.Longitude,2,1)=lo.Longitude
where lt.latitude is not null or lo.Longitude is not null

See demo Fiddle

CodePudding user response:

This averages out the center of the locations and looks for ones far from it

SELECT *
,  ABS((SELECT Sum(Latitude) / COUNT(*) FROM #Test) - Latitude) 
   ABS((SELECT Sum(Longitude) / COUNT(*) FROM #Test) - Longitude) as Awayfromhome
from #Test
Order by Awayfromhome desc
  • Related