Home > Enterprise >  When and how to remove DataFrame from cache in spark?
When and how to remove DataFrame from cache in spark?

Time:10-13

Just learning spark, and I wonder if I during spark script I should clean dataframes after I'm executing a code that runs the DF?

e.g,

# Do something on friends DF...
friendsByAge = lines.select("age", "friends")
friendsByAge.groupBy("age").avg("friends").show()

# now do something unrelated to friends DF

In the case above, is friendsByAge DF is kept in memory during the entire driver script execution (even after I don't need it anymore) and if it does, should I clean it somehow, or once I show it it's removed from memory?

CodePudding user response:

DataFrame is being loaded lazily, so it's only loaded when you run the action show. Also, it won't be cached automatically (only if you explicitly cache or persist it), so you don't need to worry about cleaning it. If you do cache a DataFrame called df, you can remove it from cache using:

df.unpersist()
  • Related