Databricks Spark SQL query plan optimization to avoid read duplicate-CodePudding

I have a spark SQL query as below:

with xxx as (
select pb_id, pb_name, req_id, level, t_val_id_path
from(
  select pb_id, pb_name, req_id, explode(req_vals) as t_id
  from A
  where dt = '2022-11-20') a
join (
  select t_val_id, level, t_val_id_path
  from B
  where dt = '2022-11-20')b
on a.t_id = b.t_val_id)
select distinct 
  pb_id, req_id, -1 as l1, -1 as l2, -1 as l3
from xxx 
union all
select distinct 
  pb_id, req_id, t_val_id_path[0] as l1, -1 as l2, -1 as l3
from xxx 
union all
select distinct 
  pb_id, req_id, t_val_id_path[0] as l1, t_val_id_path[1] as l2, -1 as l3
from xxx 
where level > 0
union all
select distinct
  pb_id, req_id, t_val_id_path[0] as l1, if(level > 0, t_val_id_path[1], 1) as l2, if(level > 1, t_val_id_path[2], 1) as l3
from xxx;

In Azure Databricks, the SQL query plan is below:

Question: From the SQL script it may just read table A & B of hive table. But in the query plan, we could see that we would read A 4 times and B 4 times. Is it possible that we read A & B just once and then do the filter and transformation in memory instead to read it again and again?

CodePudding user response：

Maybe you can try to cache your xxx table. You can use cache also in sql, not only in df api

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-cache-cache-table.html

So imo it may look like this:

spark.sql("CACHE TABLE xxx as (
select pb_id, pb_name, req_id, level, t_val_id_path
from(
  select pb_id, pb_name, req_id, explode(req_vals) as t_id
  from A
  where dt = '2022-11-20') a
join (
  select t_val_id, level, t_val_id_path
  from B
  where dt = '2022-11-20')b
on a.t_id = b.t_val_id)")

spark.sql(
    "select distinct 
      pb_id, req_id, -1 as l1, -1 as l2, -1 as l3
    from xxx 
    union all
    select distinct 
      pb_id, req_id, t_val_id_path[0] as l1, -1 as l2, -1 as l3
    from xxx 
    union all
    select distinct 
      pb_id, req_id, t_val_id_path[0] as l1, t_val_id_path[1] as l2, -1 as l3
    from xxx 
    where level > 0
    union all
    select distinct
      pb_id, req_id, t_val_id_path[0] as l1, if(level > 0, t_val_id_path[1], 1) as l2, if(level > 1, t_val_id_path[2], 1) as l3
    from xxx"
)