I have a table with about 100 000 names/rows that look something like this. There are about 3000 different Refnrs. The names are clustered around the Refnr geographically. The problem is that there are some names that have the wrong location. I need to find the rows who dont fit in with the others. I figured I would do this by finding the Latidude OR Longitude that is too far away from the Longitude and Latitude in the rest of the same Refnrs. So if you see the first Refnr they two of them are located at Latitude 10.67xxx, and 1 is located at Latitude 10.34xxx.
So if I say that I want to compare all the names in the different Refnrs and sort out where the 2nd decimal number differs from the rest of the names.
Is there any way to do this so that I dont have to manually run a query 3000 times?
Refnr | Latitude | Longitude | Name |
---|---|---|---|
123 | 10.67643 | 50.67523 | bob |
123 | 10.67143 | 50.67737 | joe |
123 | 10.34133 | 50.67848 | al |
234 | 11.56892 | 50.12324 | berny |
234 | 11.56123 | 50.12432 | bonny |
234 | 11.98135 | 50.12223 | arby |
567 | 10.22892 | 50.67143 | nilly |
567 | 10.22123 | 50.67236 | tilly |
567 | 10.22148 | 50.22422 | billy |
I need a select to give me this.
Refnr | Latitude | Longitude | Name |
---|---|---|---|
123 | 10.34133 | 50.67848 | al |
234 | 11.98135 | 50.12223 | arby |
567 | 10.22148 | 50.22422 | billy |
Thanks for the help.
CodePudding user response:
This got overly complex, and may not be that useful. Still, it was interesting to work on.
First, set up the test data
DROP TABLE #Test
GO
CREATE TABLE #Test
(
Refnr int not null
,Latitude decimal(7,5) not null
,Longitude decimal(7,5) not null
,Name varchar(100) not null
)
INSERT #Test VALUES
(123, 10.67643, 50.67523, 'bob')
,(123, 10.67143, 50.67737, 'joe')
,(123, 10.34133, 50.67848, 'al')
,(234, 11.56892, 50.12324, 'berny')
,(234, 11.56123, 50.12432, 'bonny')
,(234, 11.98135, 50.12223, 'arby')
,(567, 10.22892, 50.67143, 'nilly')
,(567, 10.22123, 50.67236, 'tilly')
,(567, 10.22148, 50.22422, 'billy')
SELECT *
from #Test
As requirements are a tad imprecise, use this to round lat, lon to the desired precision. Adjust as necessary.
DECLARE @Precision TINYINT = 1
--SELECT
-- Latitude
-- ,round(Latitude, @Precision)
-- from #Test
Then it gets messy. Problems will up with if there are multiple "outliers", by EITHER latitude OR longitude. I think this will account for all, and remove duplicates, but further review and testing is called for.
;WITH cteGroups as (
-- Set up groups by lat/lon proximity
SELECT
Refnr
,'Latitude' Type
,round(Latitude, @Precision) Proximity
,count(*) HowMany
from #Test
group by
Refnr
,round(Latitude, @Precision)
UNION ALL SELECT
Refnr
,'Longitude' Type
,round(Longitude, @Precision) Proximity
,count(*) HowMany
from #Test
group by
Refnr
,round(Longitude, @Precision)
)
,cteOutliers as (
-- Identify outliers
select
Type
,Refnr
,Proximity
,row_number() over (partition by Type, Refnr order by HowMany desc) Ranking
from cteGroups
)
-- Pull out all items that match with outliers
select te.*
from cteOutliers cte
inner join #Test te
on te.Refnr = cte.Refnr
and ( (cte.Type = 'Latitude' and round(te.Latitude, @Precision) = Proximity)
or (cte.Type = 'Longitude' and round(te.Longitude, @Precision) = Proximity) )
where cte.Ranking > 1 -- Not in the larger groups
CodePudding user response:
Here's what is hopefully a working solution - it gives the 3 outliers from your sample data, will be interesting to see if it works on your larger data set.
Create a CTE for each longitude and latitude, count the number of matching values based on first 2 decimal places only and choose the minimum of each group - that's the group's outlier.
Join the results with the main table and filter to only rows matching the outlier lat or long.
with outlierLat as (
select top (1) with ties refnr, Round(latitude,2,1) latitude
from t
group by refnr, Round(latitude,2,1)
order by Count(*)
), outlierLong as (
select top (1) with ties refnr, Round(Longitude,2,1) Longitude
from t
group by refnr, Round(Longitude,2,1)
order by Count(*)
)
select t.*
from t
left join outlierLat lt on lt.refnr=t.refnr and Round(t.latitude,2,1)=lt.latitude
left join outlierLong lo on lo.refnr=t.refnr and Round(t.Longitude,2,1)=lo.Longitude
where lt.latitude is not null or lo.Longitude is not null
CodePudding user response:
This averages out the center of the locations and looks for ones far from it
SELECT *
, ABS((SELECT Sum(Latitude) / COUNT(*) FROM #Test) - Latitude)
ABS((SELECT Sum(Longitude) / COUNT(*) FROM #Test) - Longitude) as Awayfromhome
from #Test
Order by Awayfromhome desc