How to improve performance of ST_INTERSECT in BigQuery?-CodePudding

I have two tables, one with 166M address points and one with 170,000 precinct polygons. I want to intersect the points through the polygons to get a table containing the address id and the precinct id for the precinct the address is found in. To avoid issues with precincts that overlap state boundaries (because this is a merged dataset of individual state datasets), I have added a condition that the state in each file be the same.

To improve performance, both input tables have been clustered by their respective geometry fields, an inner join is used, and the input tables are materialized as tables (as per the documentation on using spatial joins). However, the process times out before it finishes.

To troubleshoot, I ensured that the precinct polygons have the expected orientation (which was the issue in this post), and I have also reviewed the BigQuery documentation on optimizing queries

The first thing I tried was a standard inner join:

SELECT
    addr.id_address,
    prec.id_precinct,
    prec.geom  
FROM precincts AS prec
INNER JOIN addresses AS addr  
    ON prec.geo_state = addr.geo_state
        AND ST_INTERSECTS(addr.geom, prec.geom)

I have also tried using a window function:

SELECT *
FROM (
    SELECT
        addr.id_address,
        prec.id_precinct,
        prec.geom,
        ROW_NUMBER() OVER(PARTITION BY addr.geo_state) AS rn
    FROM addresses AS addr
    INNER JOIN precincts AS prec
        ON prec.geo_state = addr.geo_state
            AND ST_INTERSECTS(addr.geom, prec.geom)
) AS ranked
WHERE rn = 1

In both cases, it times out before finishing. Any help on how to optimize this query is much appreciated. Note that I am using dbt to run this, and I have my timeout set to 1800 sec (it times out when running it in the BigQuery UI, though, too).

CodePudding user response：

TL;DR: Does the query runs fast enough is you remove the state equality condition prec.geo_state = addr.geo_state? If yes, the solution is to remove it and filter separately as separate query, or turn it into something where both sides of = depend on left and right join children, e.g.

CONCAT(prec.geo_state, ".",  addr.geo_state) = CONCAT(addr.geo_state, ".", prec.geo_state)

Explanation: BigQuery does not yet support efficient mixing of different types of JOIN conditions. If it sees both an equality condition, and a geospatial condition in a single join, it "prefers" equality join, executing geospatial predicate as a regular filter condition, instead of building and using geospatial indices.

And that's why this trick often improves performance: it "hides" equality condition, makes BigQuery use geospatial join, and only then check the now-complex equality condition. Of course, it only works if the query without state equality predicate runs reasonably fast.