I am trying to read in a large amount of Avro files in subdirectories from s3 using spark.read.load
on databricks. I either get an error due to the max result size exceeding spark.driver.maxResultSize
, or if I increase that limit, the driver runs out of memory.
I am not performing any collect operation so I'm not sure why so much memory is being used on the driver. I wondered if it was something to do with an excessive number of partitions, so I tried playing around with different values of spark.sql.files.maxPartitionBytes
, to no avail. I also tried increasing memory on the driver and using a bigger cluster.
The only thing that seemed to help slightly was specifying Avro schema beforehand rather than inferring; this meant the spark.read.load
finished without error, however memory usage on the driver was still extremely high and the driver still crashed if I attempted any further operations on the resulting DataFrame.
CodePudding user response:
I discovered the problem was the spark.sql.sources.parallelPartitionDiscovery.parallelism
option. This was set too low for the large number of files I was trying to read, resulting in the driver crashing. Increased the value of this and now my code is working.