Home > Software engineering >  Spark SQL view and partition column usage
Spark SQL view and partition column usage

Time:10-25

I have a Databricks table (parquet not delta) "TableA" with a partition column "dldate", and it has ~3000 columns. When I issue select * from TableA where dldate='2022-01-01', the query completes in seconds.

I have a view "view_tableA" which reads from "TableA" and performs some window functions on some of the columns. When I issue select * from view_tableA where dldate='2022-01-01', the query runs forever.

Will the latter query effectively use the partition key of the table? If not, if there is any optimization I can do to make sure partition key is used?

CodePudding user response:

If partitioning of all window functions is aligned with table partitioning, optimizer will be able to push down the predicate to table level and apply partition pruning.

For example:

SELECT *
  FROM (SELECT *, sum(a) over (partition by dldate) FROM TableA)
 WHERE dldate = '2022-01-01';

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
 - Window [dldate#2932, a#2933, sum(a#2933) ...], [dldate#2932]
    - Sort [dldate#2932 ASC NULLS FIRST], false, 0
       - Exchange hashpartitioning(dldate#2932, 200), ...
          - Project [dldate#2932, a#2933]
             - FileScan parquet tablea PartitionFilters: [isnotnull(dldate#2932), (dldate#2932 = 2022-01-01)]

Compare this with a query containing window function not partitioned by dldate:

SELECT *
  FROM (SELECT *, sum(a) over (partition by a) FROM TableA)
 WHERE dldate = '2022-01-01';

AdaptiveSparkPlan isFinalPlan=false
 - Filter (isnotnull(dldate#2968) AND (dldate#2968 = 2022-01-01))  << !!!
    - Window [dldate#2968, a#2969, sum(a#2969) ...], [a#2969]
       - Sort [a#2969 ASC NULLS FIRST], false, 0
          - Exchange hashpartitioning(a#2969, 200), ...
             - Project [dldate#2968, a#2969]
                - FileScan parquet tablea PartitionFilters: []  << !!!
  • Related