I'm working with Apache Spark and I have the following code:
Dataset<Row> tradesDataset = sparkSession
.sql("select * from a_table")
.cache(); // <-- do I need caching here?
long countOfDistinctUitIdsInTradeAgreements = tradesDataset
.select(tradesDataset.col("uitid"))
.distinct()
.count();
long countOfDistinctUitIdsInTradeAgreementsForTradeDate = tradesDataset
.filter(tradesDataset.col("TRADE_DATE").equalTo(processingDate))
.count();
I select one column from the dataset and count over it.
So my question is whether I need to cache the dataset from select? Will it bring any performance improvement?
CodePudding user response:
Looking at your example it may run quicker without cache
You are caching after doing select * from so you are going to read whole dataset with all columns and store it in cache.
Later you are using it only to get counts for which you dont need whole dataset but only one column.
Without cache you are going to read source twice, thats correct, but most likely Spark is going to push down projections and will figure out that only one column should be selected in both case which means that the real i/o may be smaller
Of course it may depend on source you are reading from so i think that its worth to check out both options and compare results