I have a Databricks table (parquet not delta) "TableA" with a partition column "dldate", and it has ~3000 columns.
When I issue select * from TableA where dldate='2022-01-01'
, the query completes in seconds.
I have a view "view_tableA" which reads from "TableA" and performs some window functions on some of the columns.
When I issue select * from view_tableA where dldate='2022-01-01'
, the query runs forever.
Will the latter query effectively use the partition key of the table? If not, if there is any optimization I can do to make sure partition key is used?
CodePudding user response:
If partitioning of all window functions is aligned with table partitioning, optimizer will be able to push down the predicate to table level and apply partition pruning.
For example:
SELECT *
FROM (SELECT *, sum(a) over (partition by dldate) FROM TableA)
WHERE dldate = '2022-01-01';
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
- Window [dldate#2932, a#2933, sum(a#2933) ...], [dldate#2932]
- Sort [dldate#2932 ASC NULLS FIRST], false, 0
- Exchange hashpartitioning(dldate#2932, 200), ...
- Project [dldate#2932, a#2933]
- FileScan parquet tablea PartitionFilters: [isnotnull(dldate#2932), (dldate#2932 = 2022-01-01)]
Compare this with a query containing window function not partitioned by dldate:
SELECT *
FROM (SELECT *, sum(a) over (partition by a) FROM TableA)
WHERE dldate = '2022-01-01';
AdaptiveSparkPlan isFinalPlan=false
- Filter (isnotnull(dldate#2968) AND (dldate#2968 = 2022-01-01)) << !!!
- Window [dldate#2968, a#2969, sum(a#2969) ...], [a#2969]
- Sort [a#2969 ASC NULLS FIRST], false, 0
- Exchange hashpartitioning(a#2969, 200), ...
- Project [dldate#2968, a#2969]
- FileScan parquet tablea PartitionFilters: [] << !!!