Home > database >  Is there a way to tell before the write how many files will be created when saving Spark Dataframe a
Is there a way to tell before the write how many files will be created when saving Spark Dataframe a

Time:10-15

I am currently trying to save a Spark Dataframe to Azure Data Lake Storage (ADLS) Gen1. While doing so I recevie the following throttling error:

org.apache.spark.SparkException: Job aborted. Caused by: com.microsoft.azure.datalake.store.ADLException: Error creating file /user/DEGI/CLCPM_DATA/fraud_project/policy_risk_motorcar_with_lookups/part-00000-34d88646-3755-488d-af00-ef2e201240c8-c000.snappy.parquet
Operation CREATE failed with HTTP401 : null
Last encountered exception thrown after 2 tries. [HTTP401(null),HTTP401(null)]

I read in the documentation that the throttling occurs due to CREATE limits, which then causes the job to abord. The documentation also gives three reasons why this may happen.

  1. Your application creates a large number of small files.
  2. External applications create a large number of files.
  3. The current limit for the subscription is too low.

While I do not think that my subscription is too low, I think it may be the case that my application is creating too many parquet files. Does anyone know how to tell how many files will be created when saving as table ? How can I find out the max number of files that I am allowed to create ?

The code that I use to create the table looks as follows:

df.write.format("delta").mode("overwrite").saveAsTable("database_name.df", path ='adl://my path to storage')
 

Also, I was able to write a smaller test dataframe without any problems.Plus The permissions of the folder in adls are set correctly.

CodePudding user response:

The error you have doesn't look like an issue with number of file. 401 is an unauthorized issue. Nonetheless:

Spark writes at least as many file as there are partitions. So what you want is to do is repartition your dataframe. There are several repartition api, and to reduce partition and data distribution, it is recommended to use coalesce()

df.coalesce(10).write....

You can also read

  • Related