Spark syntax for caching-CodePudding

I am usually using the following syntax for caching dataframes:

df = dataframe.select(col1, col2, col3)
df.cache()

function1(df)
function2(df)

Now I have observed a case in which the first syntax does not work, but his one does:

df = dataframe.select(col1, col2, col3).cache()

function1(df)
function2(df)

Can there be cases in which the syntax can make a difference? I think this is doing exactly the same. I am using Spark 2.4.4

Edit: My guess is the following:

Case 1: If function1 is only using part of the data, it will only cache the data that is needed. If function2 needs different data than function1, the dataframe in cache would not be sufficient, therefore the dataframe needs to be re-created.

Case 2: Creating a new dataframe with cache() will put the full dataframe in memory, so there is no need to re-create it later for different data.

Is that a possibility?

CodePudding user response：

The two code blocks in the question are not equivalent.

Calling .cache() on a dataframe does not cache the data in the dataframe but returns a cached dataframe.

The following 2 code blocks are equivalent

df = dataframe.select(col1, col2, col3)
df = df.cache() # reassign df to use the cached version before using further

function1(df)
function2(df)

df = dataframe.select(col1, col2, col3).cache()

function1(df)
function2(df)

CodePudding user response：

DataFrame is immutable in Spark.

When you call cache() on df you don't cache df, but you get new DataFrame that is cached.

You have to save your new cached DataFrame and then use it in your functions.

It is important to note that in your first example you don't use benefits of caching and calling functions with your df starts calculation each time all methods that you did before calling cache()

First example need to be rewritten like this:

df = dataframe.select(col1, col2, col3)
newDf = df.cache()

function1(newDf)
function2(newDf)