Bigquery: querying partitioned table with SELECT within WHERE doesn't work as expected-CodePudding

I am trying to optimise some DBT models by using incremental runs on partitioned data and ran into a problem - the suggested approach that I've found doesn't seem to work. By not working I mean that it doesn't decrease the processing load as I'd expect.

Below is the processing load of a simple select of the partitioned table:

unfiltered query processing load

Now I try to select only new rows added since the last incremental run:

filtered query

You can notice, that the load is exactly the same.

However, the select inside the WHERE is rather lightweight:

selecting only the max date

And when I fill in the result of that select manually, the processing load is suddenly minimal, what I'd expect:

expected processing load

Finally, both tables (the one I am querying data from, and the one I am querying max(event_time)) are configured in exactly the same way, both being partitioned by DAY on field event_time:

config on tables

What am I doing wrong? How could I improve my code to actually make it work? I'd expect the processing load to be similar to the one using an explicit condition.

P.S. apologies for posting links instead of images. My reputation is too low, as this is my first question here.

CodePudding user response：

Since the nature of query is dynamic, i.e. the where condition is not absolute(constant), BigQuery cannot estimate the accurate processed data before execution.

This is due the fact that max(event_time) is not constant and might change, hence affecting the size of the data to be fetched by the outer query.

For estimation purposes, try one of these 2 approaches:

Replace the inner query by a constant value and check the estimated bytes to be processed.
Try running the query once and check the processed data under Query results -> Job Information ->Bytes processed and Bytes billed