Identify stores with same location in Bigquery-CodePudding

So, let's say I have this data:

store_name               | latitude    | longitude | store_id | same_location_count | same_location_store_id
SR Restaurant and Cafe   | -41.575449  | 147.16824 | 1112     | 0           | null
Big Bite Burgers         | -41.575449  | 147.16824 | 1113     | 0           | null
Amigos                   | -41.575449  | 147.16824 | 1114     | 0           | null
Domino's                 | -38.33983   | 143.58384 | 1115     | 0           | null

I want to identity the stores that are in same coordinates.

The expected result looks like this:

store_name               | latitude    | longitude | store_id | same_location_count | same_location_store_id
SR Restaurant and Cafe   | -41.575449  | 147.16824 | 1112     | 2           | 1113:1114
Big Bite Burgers         | -41.575449  | 147.16824 | 1113     | 2           | 1112:1114
Amigos                   | -41.575449  | 147.16824 | 1114     | 2           | 1112:1113
Domino's                 | -38.33983   | 143.58384 | 1115     | 0           | null

If the stores are in same coordinates, we increment the same_location_count and get the ID of the same store location and concatenate it to same_location_store_id.

CodePudding user response：

You can use count and string_agg function for achieving this.

Test code:

with test as (
  select 'id1' id, '-41.575449' lat, '1112' long
  union all
  select 'id2' id, '-41.575449' lat, '1112' long
  union all
  select 'id3' id, '-41.575449' lat, '1112' long
  union all
  select 'id4' id, '-38.33983' lat, '1115' long
)
SELECT
  *,
  (COUNT(id) OVER (PARTITION BY lat, long))-1 AS same_location_count,
  REPLACE( STRING_AGG(CONCAT(id,':'),'') OVER (PARTITION BY lat, long), CONCAT(id,':'), '') AS same_location_store_id
FROM
  test

Edit: match the exact scenario without counting the id of each row and excluding it from the concatenation.

CodePudding user response：

Maybe something similar to the following. This works for Postgres DB, I think it should work in BigQuery also?

SELECT
  stores.*,
  COUNT(store_id) OVER (PARTITION BY latitude,longitude) - 1 AS same_location_count,
  NULLIF(
    TRIM(REPLACE(
      CONCAT(':', STRING_AGG(CAST(store_id AS TEXT), ':') OVER (PARTITION BY latitude,longitude), ':'),
      CONCAT(':', CAST(store_id AS TEXT), ':'),
      ':'
    ), ':'),
    ''
  ) AS same_location_store_id
FROM stores

Fiddle link: https://www.db-fiddle.com/f/gUfJm7X75Fdx1nuy9hSrsD/1

Edit: Updated from comment.

Count is the easy part; just subtract 1 from value. Same location stores you can see has become a horrible mess...

Basically the logic is:

Collect all store_ids for same location (now using string_agg instead of array_agg
Make a string of :idA:idB:...:idN: with leading,trailing :
Remove the store_id of same store by replace ":id:" with just ":"
Tidy up afterwards to make sure null for empty string, and remove leading,trailing : again

Step 3 works only because store_id is integer, and so guaranteed not to contain separator char :