Spark: Is this wrong way to cache temp view?-CodePudding

I've seen follow code and i think that it is a wrong way to cache tempview in Spark. What do you think?

spark.sql(
      s"""
         |...
       """.stripMargin).createOrReplaceTempView(s"temp_view")

    spark.table(s"temp_view").cache()

For my opinion, this code caches DataFrame that I create by spark.table("temp_view"), but not original temp view.

Am I right?

CodePudding user response：

Imo yes, you are caching what you read from this table, but for example if in next line you are going to read it again you will end up with second scan

I think that maybe you can try to use cache table within your sql

https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-cache-table.html

CACHE TABLE statement caches contents of a table or output of a query with the given storage level. If a query is cached, then a temp view will be created for this query. This reduces scanning of the original files in future queries.

For me its seems promising

CodePudding user response：

Try to cache it explicitly:

spark.cacheTable("temp_view")

CodePudding user response：

I think the caching in your example will actually work. Spark does not cache instances of DataFrame. Instead, it uses logical plans as the cache key, and the view is transparent for that purpose. For example, here's the code I've just tried using some local table I have

val df = spark.table("mart.dim_region")
df.createOrReplaceTempView("dim_region")
spark.table("dim_region").cache()

Even though cache is applied to view, if I repeatedly invoke df.show, the execution plan contains InMemoryTableScan - which is precisely the effect of caching.