Suggestion for multiple joins in spark-CodePudding

Recently I got a requirement to perform combination joins.

I have to perform around 30 to 36 joins in Spark.

It was consuming more time to build the execution plan. So I cached the execution plan in intermediate stages using df.localCheckpoint().

Is this a good way to do? Any thoughts, please share.

CodePudding user response：

Yes, it is fine.

This is mostly discussed for iterative ML algorithms, but can be equally applied for a Spark App with many steps - e.g. joins.

Quoting from https://medium.com/@adrianchang/apache-spark-checkpointing-ebd2ec065371:

Spark programs take a huge performance hit when fault tolerance occurs as the entire set of transformations to a DataFrame or RDD have to be recomputed when fault tolerance occurs or for each additional transformation that is applied on top of an RDD or DataFrame.

localCheckpoint() is not "reliable".

CodePudding user response：

Caching is definitely a strategy to optimize your performance. In general, given that your data size and resource of your spark application remains unchanged, there are three points that need to be considered when you want to optimize your joining operation:

Data skewness: In most of the time, when I'm trying to find out the reason why the joining takes a lot of time, data skewness is always be one of the reasons. In fact, not only the joining operation, any transformation need a even data distribution so that you won't have a skewed partition that have lots of data and wait the single task in single partition. Make sure your data are well distributed.
Data broadcasting: When we do the joining operation, data shuffling is inevitable. In some case, we use a relatively small dataframe as a reference to filter the data in a very big dataframe. In this case, it's a very expensive operation to shuffle the dataframe. Instead, we can use the dataframe broadcasting to broadcast your small dataframe to every single node and prevent the costly shuffling.
Keep your joining data as lean as possible: like what I mentioned in point 2, data shuffling is inevitable when you do the joining operation. Therefore, please keep your dataframe as lean as possible, which means remove the rows / columns if it's unnecessary to reduce the size of data that need to be moved across the network during the data shuffling.