Home > database >  Databricks Spark SQL query plan optimization to avoid read duplicate
Databricks Spark SQL query plan optimization to avoid read duplicate

Time:11-29

I have a spark SQL query as below:

with xxx as (
select pb_id, pb_name, req_id, level, t_val_id_path
from(
  select pb_id, pb_name, req_id, explode(req_vals) as t_id
  from A
  where dt = '2022-11-20') a
join (
  select t_val_id, level, t_val_id_path
  from B
  where dt = '2022-11-20')b
on a.t_id = b.t_val_id)
select distinct 
  pb_id, req_id, -1 as l1, -1 as l2, -1 as l3
from xxx 
union all
select distinct 
  pb_id, req_id, t_val_id_path[0] as l1, -1 as l2, -1 as l3
from xxx 
union all
select distinct 
  pb_id, req_id, t_val_id_path[0] as l1, t_val_id_path[1] as l2, -1 as l3
from xxx 
where level > 0
union all
select distinct
  pb_id, req_id, t_val_id_path[0] as l1, if(level > 0, t_val_id_path[1], 1) as l2, if(level > 1, t_val_id_path[2], 1) as l3
from xxx;

In Azure Databricks, the SQL query plan is below: enter image description here

Question: From the SQL script it may just read table A & B of hive table. But in the query plan, we could see that we would read A 4 times and B 4 times. Is it possible that we read A & B just once and then do the filter and transformation in memory instead to read it again and again?

CodePudding user response:

Maybe you can try to cache your xxx table. You can use cache also in sql, not only in df api

https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-cache-cache-table.html

So imo it may look like this:

spark.sql("CACHE TABLE xxx as (
select pb_id, pb_name, req_id, level, t_val_id_path
from(
  select pb_id, pb_name, req_id, explode(req_vals) as t_id
  from A
  where dt = '2022-11-20') a
join (
  select t_val_id, level, t_val_id_path
  from B
  where dt = '2022-11-20')b
on a.t_id = b.t_val_id)")

spark.sql(
    "select distinct 
      pb_id, req_id, -1 as l1, -1 as l2, -1 as l3
    from xxx 
    union all
    select distinct 
      pb_id, req_id, t_val_id_path[0] as l1, -1 as l2, -1 as l3
    from xxx 
    union all
    select distinct 
      pb_id, req_id, t_val_id_path[0] as l1, t_val_id_path[1] as l2, -1 as l3
    from xxx 
    where level > 0
    union all
    select distinct
      pb_id, req_id, t_val_id_path[0] as l1, if(level > 0, t_val_id_path[1], 1) as l2, if(level > 1, t_val_id_path[2], 1) as l3
    from xxx"
)
  • Related