Home > database >  Finding shortest geo-spatial distance from one point to all other points in SQL
Finding shortest geo-spatial distance from one point to all other points in SQL

Time:12-05

There are two types of users who purchase the movie tickets from either town A, town B, town C or online.

I have the following tables as:

Locations: This table consists of locations of movie centers

|--------------|------------|------------|
|    Towns     |  latitude  | longitude  |
|--------------|------------|------------|
|  Town_A      |  72.92629  | -12.89272  |
|  Town_B      |  93.62789  | -83.10172  |
|  Town_C      |  68.92612  | -67.17242  |
|--------------|------------|------------|

Users: This table contains the history of user's purchase i.e. either online or in towns. Also consists of user's latitude/longitude during the purchase.

|------------|------------|------------|--------------|
|   user_id  |  latitude  | longitude  |    Towns     |
|------------|------------|------------|--------------|
|    1       |  21.89027  | -53.03772  |   Town_A     |
|    1       |  23.87847  | -41.78172  |   Town_C     |
|    1       |  39.62847  | -80.19892  |   online     |
|    1       |  77.87092  | -96.39242  |   Town_A     |
|    2       |  71.87782  | -38.03782  |   online     |
|    2       |  83.37847  | -62.78278  |   Town_B     |
|    3       |  89.81924  | -80.73892  |   Town_B     |
|    3       |  27.87282  | -18.39183  |   Town_A     |
|------------|------------|------------|--------------|

I want to find the nearest town based on user's lat/long during his purchase. The final table would look like below:

|------------|------------|------------|--------------|-----------------|
|   user_id  |  latitude  | longitude  |    Towns     | nearest_town    |
|------------|------------|------------|--------------|-----------------|
|    1       |  21.89027  | -53.03772  |   Town_A     |   Town_B        | <--- Town_B is near based on his lat/long (Irrespective of his purchase town)
|    1       |  23.87847  | -41.78172  |   Town_C     |   Town_A        | <--- Town_A is near based on his lat/long
|    1       |  39.62847  | -80.19892  |   online     |   Town_Online   |
|    1       |  77.87092  | -96.39242  |   Town_A     |   Town_A        |
|    2       |  71.87782  | -38.03782  |   online     |   Town_Online   |
|    2       |  83.37847  | -62.78278  |   Town_B     |   Town_C        |
|    3       |  89.81924  | -80.73892  |   Town_B     |   Town_A        |
|    3       |  27.87282  | -18.39183  |   Town_A     |   Town_A        |
|------------|------------|------------|--------------|-----------------|

SQL Query (Snowflake) my attempt:

With specific_location as
(
  select user_id,
         latitude,
     longitude,
     case when Towns in ('Town_A','Town_B','Town_C') then 'Town' else 'Town_Online' end as purchase_in
  from Locations
)
 select *, 
       case when purchase_in = 'Town' then
            (select Towns from Location qualify row_number() over (order by haversine(user.latitude,user.longitude,location.latitude,location.longitude))=1)
            else purchase_in
       end as nearest_town
 from specific_location

I'm getting an error: syntax error unexpected 'when' and unexpected 'else'

CodePudding user response:

You CTE specific_location is missing a JOIN to USERS as locations itself does not have a user_id column.

I would also make an enriched user, to add a sequence just so later the location match can be distinctly per user row, and then do the user/location join in a second CTE, and thus the select you do at the end is with pre-computed values:

I also swapped you two value CASE statements for IFF's

WITH enriched_user AS (
    SLECT 
        u.user_id,
        u.latitude,
        u.longitude,
        u.town,
        seq4() as seq,
        IFF(towns IN ('Town_A','Town_B','Town_C'), 'Town', 'Town_Online') AS purchase_in
    FROM user AS u
), user_and_closest_location AS (
    SELECT 
        u.user_id,
        u.latitude,
        u.longitude,
        u.town,
        u.purchase_in
        l.town as closest_town
        haversine(u.latitude, u.longitude, l.latitude, l.longitude)
    FROM enriched_user AS u,
        location AS l
    QUALIFY row_number() OVER (PARTION BY u.seq ORDER BY haversine(u.latitude, u.longitude, l.latitude, l.longitude)) = 1
)
SELECT      
    u.user_id,
    u.latitude,
    u.longitude,
    u.town,
    IFF(u.purchase_in = 'Town', u.closest_town, u.purchase_in) AS nearest_town
FROM user_and_closest_location AS u
ORDER BY 1,2,3; 

The logic all calculating the distance based join for all row, is that it will be faster, and if there are things you want to not do it for, it would be better to prune the input there, but then you will need to rejoin to input to captured the skipped values.

WITH enriched_user AS (
    SLECT 
        u.user_id,
        u.latitude,
        u.longitude,
        u.town,
        seq4() as seq,
        IFF(towns IN ('Town_A','Town_B','Town_C'), 'Town', 'Town_Online') AS purchase_in
    FROM user AS u
), user_and_closest_location AS (
    SELECT 
        u.user_id,
        u.latitude,
        u.longitude,
        u.town,
        u.purchase_in
        l.town as closest_town
        haversine(u.latitude, u.longitude, l.latitude, l.longitude)
    FROM enriched_user AS u,
        location AS l
    WHERE u.purchase_in = 'Town'
    QUALIFY row_number() OVER (PARTION BY u.seq ORDER BY haversine(u.latitude, u.longitude, l.latitude, l.longitude)) = 1
)
SELECT      
    u.user_id
    u.latitude,
    u.longitude,
    u.town,
    IFF(u.purchase_in = 'Town', ucl.closest_town, u.purchase_in) AS nearest_town
FROM enriched_user user_and_closest_location AS u
LEFT JOIN user_and_closest_location AS ucl 
    ON u.seq = ucl.seq
ORDER BY 1,2,3;

also the in towns could be flipped to be not 'online`

IFF(towns IN ('Town_A','Town_B','Town_C'), 'Town', 'Town_Online') AS purchase_in

becoming:

IFF(towns != 'online', 'Town', 'Town_Online')

at which point the actual test can be moved to where it's used later.

  • Related