Why am I getting out of memory error on spark driver when trying to read lots of Avro files? No coll-CodePudding

I am trying to read in a large amount of Avro files in subdirectories from s3 using spark.read.load on databricks. I either get an error due to the max result size exceeding spark.driver.maxResultSize, or if I increase that limit, the driver runs out of memory.

I am not performing any collect operation so I'm not sure why so much memory is being used on the driver. I wondered if it was something to do with an excessive number of partitions, so I tried playing around with different values of spark.sql.files.maxPartitionBytes, to no avail. I also tried increasing memory on the driver and using a bigger cluster.

The only thing that seemed to help slightly was specifying Avro schema beforehand rather than inferring; this meant the spark.read.load finished without error, however memory usage on the driver was still extremely high and the driver still crashed if I attempted any further operations on the resulting DataFrame.

CodePudding user response：

I discovered the problem was the spark.sql.sources.parallelPartitionDiscovery.parallelism option. This was set too low for the large number of files I was trying to read, resulting in the driver crashing. Increased the value of this and now my code is working.