How do you explain the difference in performance of these two queries?-CodePudding

I have the following two Queries:

WITH
  T AS(
  SELECT
    ROW_NUMBER() OVER() AS uid,
    position_geom
  FROM
    `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
  WHERE
    date > "2017-12-26" )
SELECT
  T1.uid,
  T2.uid,
  ST_DISTANCE(T1.position_geom, T2.position_geom) distance
FROM
  T AS T1,
  T AS T2
WHERE
  T1.uid <> T2.uid QUALIFY ROW_NUMBER() OVER (PARTITION BY T1.uid ORDER BY ST_DISTANCE(T1.position_geom, T2.position_geom)) < 101
ORDER BY
  T1.uid,
  ST_DISTANCE(T1.position_geom, T2.position_geom)

AND

WITH
  T AS(
  SELECT
    ROW_NUMBER() OVER() AS uid,
    position_geom
  FROM
    `bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
  WHERE
    date > "2017-12-26" ),
  T2 AS(
  SELECT
    T1.position_geom pointA,
    T1.uid uidA,
    T2.position_geom pointB,
    T2.uid uidB,
    ST_DISTANCE(T1.position_geom, T2.position_geom) distance
  FROM
    T AS T1,
    T AS T2 )
SELECT
  uidA,
  uidB,
  distance
FROM
  T2
WHERE
  uidA <> uidB QUALIFY ROW_NUMBER() OVER (PARTITION BY uidA ORDER BY distance) < 101
ORDER BY
  uidA,
  distance

Both queries give the same result, but the second goes almost twice as fast (in term of elapsed time and slot time consumed). Why is the first query not internally optimized by bigquery?

CodePudding user response：

You can see in the execution details how the queries stages are processed. The first query reads all the data that you are selecting after you join the t1 table and the t2 table, and then is fast to sort because the data is already in the temporary table, so this makes the whole sorting process faster, but when the data is being read, it makes the query execution slower.

The second query is faster because it reads the data when you are in the sorting process. This means that the sorting process is slower than the first query, but you don’t have a repartition process to read the data, and this makes the query execution faster.

Here, the difference is how the query is written because, in the second one, you are given a subquery to distribute how the query will work, and when you read it, it is faster because there is only one reading process, which makes this consume less slot time.

CodePudding user response：

In order to give some hints, you need to precise a few elements :

Which database are you using
Which index(es) is(are) defined
The number of rows in bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017
Are you querying the database directly or do you go through a connector ?

Depending on the number of rows, your database engine may not use the index. Each database has - normally - a set of tools that helps to analyse queries.

mysql (should work for MariaDB too) for exemple provides the keyword EXPLAIN and ANALIZE TABLE If you have the Enterprise Edition, you can get access to Mysql query analysis

SQLServer provide SQL server query analyzer

...