Home > OS >  Correct way to use cache() for PySpark RDD
Correct way to use cache() for PySpark RDD

Time:02-17

I'm having a hard time finding examples on how to cache RDDs in PySpark. Right now I'm doing something like this:

rdd2 = rdd1.map(...)
rdd2.cache()
foo = rdd2.collect()
bar = rdd2.count()
rdd2.unpersist()
...

I wonder if I am correctly caching rdd2, i.e. will rdd2.count() trigger another whole computation from the root RDD, or will rdd2.count() use the cached rdd2 from rdd1.map()? Do I have to call the cache function after an action on rdd2 (like after collect()) or it doesn't matter? Would appreciate any help, thanks!

CodePudding user response:

Due to laziness of spark, cache triggers after action is applied (e.g. count(), collect()). So in your case rdd.cache() 'triggers' after the first action is called (collect()) and count() will use already cached RDD.

CodePudding user response:

RDD.cache() is for whole RDD, just need to be called once even for multiple actions.

  • Related