What's the difference between storing intermediate tables in Dataframes or TempViews? Is there difference in memory?
CodePudding user response:
You can think of TempView as a temporary hive table that exists as long as the underlying Spark session is not closed.
So if you have a dataframe df
and run df.createOrReplaceTempView("something")
, you can retrieve the df by val df = spark.table("something")
anywhere in the project (within same Spark Session) as long as the createOrReplaceTempView
was called first.
Additional info here How does createOrReplaceTempView work in Spark?
CodePudding user response:
Dataframes
themselves are intermediate 'tables'. That can be cached to memory and/or disk. I leave aside the notion of code via Catalyst.
From the manuals on tempviews
:
Running SQL Queries Programmatically
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.
In order to do this you register the dataFrame as a SQL temporary view. This is a "lazy" artefact and there must already be a data frame / dataset present. It's just needs registering to allow the SQL interface.
Caching
of the underlying dataframe(s) help(s) in terms of repeated access.- The tempview is a reference in memory to a dataframe, with no overhead generally.
So in summary, within a Spark App, the dataframe is a temporary data store / intermediate table you could argue on the latter. If you need complex SQL against a dataframe that the dataframe API cannot handle, then we use the tempview. That is different to spark sql against a real Hive or jdbc read table, but the interface is the same.
This btw is a good reference: https://medium.com/@kar9475/data-sharing-between-multiple-spark-jobs-in-databricks-308687c99897