What happens when I cache a data frame in memory that will not perform actions but will be used by other data frames that will perform actions?
val sparkS: SparkSession = SparkSession.builder().getOrCreate()
val dataFrameA : DataFrame = sparkS.read
.option("header", "true")
.option("inferSchema", "true")
.csv(pathA)
.filter( condition ).cache()
val dataFrameB : DataFrame = sparkS.read
.option("header", "true")
.option("inferSchema", "true")
.csv(pathB)
val dataFrameC : DataFrame = sparkS.read
.option("header", "true")
.option("inferSchema", "true")
.csv(pathC)
val resultB = dataFrameB.crossJoin(dataFrameA)
resultB.count()
resultB.show()
val resutC = dataFrameC.crossJoin(dataFrameA)
resutC.count()
resutC.show()
Will it cache the data frame A?
CodePudding user response:
Yes, it can be helpful to cache dataFrameA
since it's output is used in multiple places.
But to take a step back, even if you don't call a method on dataFrameA
, an action can still be performed on it. When you write Spark code, you're providing Spark with a set of "transformations" that eventually end in an "action". Spark will then take the steps of the transformations / actions you provide and translate that into an execution plan. It is not important on which objects you call which methods, because as long as the data is used in a computation, it will be present in the execution plan.
If you want to see how Spark is creating the execution plan, you can use the explain()
method on your result DataFrame.
CodePudding user response:
Yes.
If you call cache
, it does cache the Dataframe.