is pyspark dataframe when passed as argument, passed as reference or value?-CodePudding

Let's say I have this code:

def func1():
    # some code to create a dataframe df
    df.persist(StorageLevel.MEMORY_AND_DISK)
    return df.repartition("col1", "col2")

def func2(df: Dataframe):
    df = (df.select("col1", "col2").groupby("col1").count().withColumnRenamed("count", "count_col1"))
    return df

so here in func2 when I pass the variable 'df' is it passed by reference or value? the repartition() that I am applying in func1, that should help with increasing perf when using the df in func2 right for groupBy? Similarly if I apply the persist() in func1, then it would be saved in memory right, an then when I refer to df in func2(), it will be referenced from the same location where it was saved only once in func1(). is that correct?

Thanks!

CodePudding user response：

In Python, if we want to understand when a parameter is passed as value or as reference, we need to understand if that parameter is mutable or immutable.

Mutable objects are the ones whose value can be changed after initializing (like list, dictionary, set). They are called by reference.
Immutable objects are the ones whose value can not be changed after initializing (like int, string, tuple, class objects). In order to change their value, we need to re-initialize them. They are called by value.

PySpark Dataframes fall under the class object category and are immutable, and hence are called by value.

CodePudding user response：

There are two aspects here:

Is the Pyton variable df passed as value or as reference?
How is the data that is referenced by df passed around?

For the first question there are some answers available, for example this one. But it is a question about Python and it is not really relevant if we want to know how Spark handles the data inside the dataframes.

To answer the second question we should consider what df really is. df does not contain the actual data and it is not even a direct reference to it. Instead all Spark transformations are recorded within the execution plan of this object and when finally an action like save, count or collect is called, Spark executes (after some optimizations) this plan. This is the very first time that any data is acutally moved by Spark's executors.

To check the execution plan you can call DataFrame.explain. If you run this function you will notice that - no matter how complex your Spark logic is - it returns immediately and prints out the execution plan. The reason for this fast response is that no actual data operations are carried out - until you run an Spark action.

TL;DR: passing a Python variable around will never move any data of your dataframe. The answer to your last question is yes.