Home > Back-end >  Caching a PySpark Dataframe
Caching a PySpark Dataframe

Time:06-10

Suppose we have a PySpark dataframe df with ~10M rows. Also let the columns be [col_a, col_b]. Which would be faster:

df_test = df.sample(0.1)
for i in range(10):
  df_sample = df_test.select(df.col_a).distinct().take(10)

or

df_test = df.sample(0.1)
df_test = df_test.cache()
for i in range(10):
  df_sample = df_test.select(df.col_a).distinct().take(10) 

Would caching df_test make sense here?

CodePudding user response:

It won't make much difference. it is just one loop where you can skip cache like below

>>> for i in range(10):
...   df_sample = df.sample(0.1).select(df.id).distinct().take(10)

Here spark is loading Data once in memory.

If you want to use df_sample further in other operations repeatedly then you can use cache()

  • Related