I'm having a hard time finding examples on how to cache RDDs in PySpark. Right now I'm doing something like this:
rdd2 = rdd1.map(...)
rdd2.cache()
foo = rdd2.collect()
bar = rdd2.count()
rdd2.unpersist()
...
I wonder if I am correctly caching rdd2, i.e. will rdd2.count() trigger another whole computation from the root RDD, or will rdd2.count() use the cached rdd2 from rdd1.map()? Do I have to call the cache function after an action on rdd2 (like after collect()) or it doesn't matter? Would appreciate any help, thanks!
CodePudding user response:
Due to laziness of spark, cache triggers after action is applied (e.g. count(), collect()). So in your case rdd.cache() 'triggers' after the first action is called (collect()
) and count()
will use already cached RDD.
CodePudding user response:
RDD.cache()
is for whole RDD, just need to be called once even for multiple actions.