I am currently using Databricks to process data coming from our Azure Data Lake. Majority of the data is being read into pySpark dataframes and are relatively big datasets. However I do have to perform some joins on smaller static tables to fetch additional attributes.
Currently, the only way in which I can do this is by converting those smaller static tables into pySpark dataframes as well. I'm just curious as to whether using such a small table as a pySpark dataframe is a bad practice? I know pySpark is meant for large datasets which need to be distributed but given that my large dataset is in a pySpark dataframe, I assumed I would have to convert the smaller static table into a pySpark dataframe as well in order to make the appropriate joins.
Any tips on best practices would be appreciated, as it relates to joining with very small datasets. Maybe I am overcomplicating something which isn't even a big deal but I was curious. Thanks in advance!
CodePudding user response:
Take a look at Broadcast joins. Wonderfully explained here https://mungingdata.com/apache-spark/broadcast-joins/
CodePudding user response:
The best practice in your case is to broadcast
your small df and joins the broadcasted df to your large df like this code below:
val broadcastedDF = sc.broadcast(smallDF)
largeDF.join(broadcastedDF)