I have a large coordinate based dataset of cities, where each city has a score and a set of coordinates which indicates the exact point where the score was taken.
There are no duplicate city names in the dataset as I've already cleansed the data by taking only the highest score from each instance of a city string. However there are locations where multiple boroughs of 1 city are included e.g. Brooklyn, Queens etc., and in some worldwide locations like Dubai there are hundreds of these boroughs.
City | Score | latitude | Longitude |
---|---|---|---|
New York | 100 | x | y |
Singapore | 98.5 | x | y |
Toronto | 96.7 | x | y |
... |
I have code which can calculate the difference between two lat/long points in metres that works nicely in SQL. What I need to do is run it through my dataset and find the minimum distance for only the rows above the one being calculated, so if in my table above for Toronto it would find the minimum distance between Toronto & NY, & Toronto & Singapore and then obviously return the Toronto/NY result.
Then I can filter by this mindistance (like WHERE mindistance > 4000) to prevent most of these multiple borough cities from appearing and therefore making my dataset useful.
As I'm only interested in the highest score in each city I only need to calculate this distance from the rows above in each case as the rows below will always have a lower score, hope that makes sense.
I've not been able to try anything along the lines presented above, I've tried searching for how to run something like a for loop in SQL but unsure how to make it only analyse the rows above in the table.
CodePudding user response:
Well, assuming that "above" means smaller id than current, and assuming you want a new column of the nearest city id-
SELECT C.Id,
C.City,
...,
(
SELECT TOP(1) C2.Id
FROM Cities AS C2
WHERE C2.Id < C.Id
ORDER BY ABS(C2.Latitude - C.Latitude) --> put distance formula here
) AS NearestCityId
FROM Cities AS C
ORDER BY C.Id
CodePudding user response:
Sample schema and data:
create table cities
(id int,
city varchar(100),
point geography);
insert into cities values (1, 'New York', 'POINT (-73.4 40.35)');
insert into cities values (2, 'Toronto', 'POINT (-79.34 43.65)');
insert into cities values (3, 'Singapore', 'POINT (103.85 1.29)');
insert into cities values (4, 'Tokyo', 'POINT (139.84 35.65)');
insert into cities values (5, 'Chicago', 'POINT (-87.62 41.88)');
insert into cities values (6, 'Milwaukee', 'POINT (-87.91 43.04)');
Query:
with cte as (
select
c1.city as city1,
c2.city as city2,
c1.point.STDistance(c2.point) as dist
from cities c1
inner join cities c2
on c1.id > c2.id
),
cte2 as (
select
row_number() over (partition by city1 order by dist asc) as row_number,
city1,
city2,
dist
from cte
)
select
city1 as city,
city2 as nearest_previous_city,
dist
from cte2
where row_number = 1
The steps are as follows:
cte
calculates the distance between the given city and all previous cities (c1.id > c2.id
)cte2
orders neighboring cities for each city by the distance, and gives them a rank (partition by city1 order by dist asc
))- for each city with predecessors,
cte3
selects the city in the nearest distance (row_number = 1
)