Home > Software engineering >  What to use to store intermediate data in Spark?
What to use to store intermediate data in Spark?

Time:10-25

What's the difference between storing intermediate tables in Dataframes or TempViews? Is there difference in memory?

CodePudding user response:

You can think of TempView as a temporary hive table that exists as long as the underlying Spark session is not closed.

So if you have a dataframe df and run df.createOrReplaceTempView("something"), you can retrieve the df by val df = spark.table("something") anywhere in the project (within same Spark Session) as long as the createOrReplaceTempView was called first.

Additional info here How does createOrReplaceTempView work in Spark?

CodePudding user response:

Dataframes themselves are intermediate 'tables'. That can be cached to memory and/or disk. I leave aside the notion of code via Catalyst.

From the manuals on tempviews:

Running SQL Queries Programmatically

  • The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.

  • In order to do this you register the dataFrame as a SQL temporary view. This is a "lazy" artefact and there must already be a data frame / dataset present. It's just needs registering to allow the SQL interface.

    • Caching of the underlying dataframe(s) help(s) in terms of repeated access.
    • The tempview is a reference in memory to a dataframe, with no overhead generally.

So in summary, within a Spark App, the dataframe is a temporary data store / intermediate table you could argue on the latter. If you need complex SQL against a dataframe that the dataframe API cannot handle, then we use the tempview. That is different to spark sql against a real Hive or jdbc read table, but the interface is the same.

This btw is a good reference: https://medium.com/@kar9475/data-sharing-between-multiple-spark-jobs-in-databricks-308687c99897

  • Related