I have a dataframe of 1 integer column made of 1B rows. So ideally, the size of the dataframe should be 1B * 4 bytes ~= 4GB. This is proven to be correct when I cache the dataframe and check the size. The size is around 4GB.
Now, if I try to broadcast the same dataframe to join with another dataframe, I get an error: Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 14 GB
Why does the size of a broadcasted dataframe increase? I have seen this in other cases as well where a 300MB dataframe shows up as 3GB broadcasted dataframe in Spark UI SQL tab.
Any reasoning or help is appreciated.
CodePudding user response:
The size increases in memory, if dataframe was broadcasted across your cluster. How much it will increase depends on how many workers you have, because Spark needs to copy your dataframe on every worker to deal with your next operations.
Do not broadcast big dataframes, only small ones, to use in join operations.
As per link:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
CodePudding user response:
More so an error according to this post. See https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-37321