I was reading for some spark optimization techniques and found some configurations that we need to enable,such as
spark.conf.set("spark.sql.cbo.enabled", true)
spark.conf.set("spark.sql.adaptive.enabled",true)
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled",true)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled",true)
Can I enable this for all my spark jobs, even if I don't need it? what are the downsides of including it? and why doesn't spark provide this performance by default? When should I use what?
CodePudding user response:
It does not turn on these features as they have a little more risk than not using them. To have the most stable platform they're not enabled by default. One thing that is called out and called out by Databricks is that CBO heavily rely on table statistics. So you need to regularly update these when your table statistics change significantly. I have hit edge cases where I had to remove CBO for my queries to complete. (I believe that this was related to a badly calculated map side join.)
The same is true of spark.sql.adaptive.skewJoin.enabled. This only helps if the table stats are up to date and you have skew. It could make your query take longer with out of data stats.
spark.sql.adaptive.coalescePartitions.enabled also looks great but should be used for specific types of performance tuning. There are knobs and levers here that could be used to drive better performance.
There settings in general are helpful might actually cover up a problem that you might want to be aware of. Yes, they are useful, yes you should use them. Perhaps you should leave them off until you need them. Often you get better performance out of tuning the algorithm of your spark job by understanding it and what it's doing. If you turn all this on by default you may not have as in-depth understanding or the implication of your choices.
(Java/Python do not force you to manage memory. This lack of understanding of the implications of what you use and its effect on performance is frequently learned the hard way with a performance issue that sneaks up on new developers.) This is a similar lesson but slight more sinister, as now they're switches to auto fix your bad queries, will you really learn to be an expert without understanding their value?
TLDR: Don't turn these on until you need them, or turn them on when you need to do something quick and dirty.
I hope this helps your understanding.