Is the same using cache() and using persist() function with no parameteres in pyspark?-CodePudding

Is there any major difference in any term between persist() no parameters and cache()?

I know that if you use cache(), the parameteres of the storage level are set by default and in persist() you can edit these parameters.

CodePudding user response：

There is no difference, actually cache() is an alias for persist, looks how it looks in code:

Source code

/**
   * Persist this Dataset with the default storage level (`MEMORY_AND_DISK`).
   *
   * @group basic
   * @since 1.6.0
   */
  def cache(): this.type = persist()

And persist without parameters which is called from within cache is:

/**
   * Persist this Dataset with the default storage level (`MEMORY_AND_DISK`).
   *
   * @group basic
   * @since 1.6.0
   */
  def persist(): this.type = {
    sparkSession.sharedState.cacheManager.cacheQuery(this)
    this
  }