What to use to store intermediate data in Spark?-CodePudding

What's the difference between storing intermediate tables in Dataframes or TempViews? Is there difference in memory?

CodePudding user response：

You can think of TempView as a temporary hive table that exists as long as the underlying Spark session is not closed.

So if you have a dataframe df and run df.createOrReplaceTempView("something"), you can retrieve the df by val df = spark.table("something") anywhere in the project (within same Spark Session) as long as the createOrReplaceTempView was called first.

Additional info here How does createOrReplaceTempView work in Spark?

CodePudding user response：

Dataframes themselves are intermediate 'tables'. That can be cached to memory and/or disk. I leave aside the notion of code via Catalyst.

From the manuals on tempviews:

Running SQL Queries Programmatically

The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.
In order to do this you register the dataFrame as a SQL temporary view. This is a "lazy" artefact and there must already be a data frame / dataset present. It's just needs registering to allow the SQL interface.
- Caching of the underlying dataframe(s) help(s) in terms of repeated access.
- The tempview is a reference in memory to a dataframe, with no overhead generally.

So in summary, within a Spark App, the dataframe is a temporary data store / intermediate table you could argue on the latter. If you need complex SQL against a dataframe that the dataframe API cannot handle, then we use the tempview. That is different to spark sql against a real Hive or jdbc read table, but the interface is the same.

This btw is a good reference: https://medium.com/@kar9475/data-sharing-between-multiple-spark-jobs-in-databricks-308687c99897