Suppose we have a PySpark dataframe df
with ~10M rows. Also let the columns be [col_a, col_b]
. Which would be faster:
df_test = df.sample(0.1)
for i in range(10):
df_sample = df_test.select(df.col_a).distinct().take(10)
or
df_test = df.sample(0.1)
df_test = df_test.cache()
for i in range(10):
df_sample = df_test.select(df.col_a).distinct().take(10)
Would caching df_test
make sense here?
CodePudding user response:
It won't make much difference. it is just one loop where you can skip cache like below
>>> for i in range(10):
... df_sample = df.sample(0.1).select(df.id).distinct().take(10)
Here spark is loading Data once in memory.
If you want to use df_sample further in other operations repeatedly then you can use cache()