I am trying to optimise some DBT models by using incremental runs on partitioned data and ran into a problem - the suggested approach that I've found doesn't seem to work. By not working I mean that it doesn't decrease the processing load as I'd expect.
Below is the processing load of a simple select of the partitioned table:
unfiltered query processing load
Now I try to select only new rows added since the last incremental run:
You can notice, that the load is exactly the same.
However, the select inside the WHERE is rather lightweight:
And when I fill in the result of that select manually, the processing load is suddenly minimal, what I'd expect:
Finally, both tables (the one I am querying data from, and the one I am querying max(event_time)) are configured in exactly the same way, both being partitioned by DAY on field event_time:
What am I doing wrong? How could I improve my code to actually make it work? I'd expect the processing load to be similar to the one using an explicit condition.
P.S. apologies for posting links instead of images. My reputation is too low, as this is my first question here.
CodePudding user response:
Since the nature of query is dynamic, i.e. the where condition is not absolute(constant), BigQuery cannot estimate the accurate processed data before execution.
This is due the fact that max(event_time)
is not constant and might change, hence affecting the size of the data to be fetched by the outer query.
For estimation purposes, try one of these 2 approaches:
- Replace the inner query by a constant value and check the estimated bytes to be processed.
- Try running the query once and check the processed data under
Query results
->Job Information
->Bytes processed
andBytes billed