Home > Back-end >  File already exists error while writing Spark dataframe to S3 using AWS Glue
File already exists error while writing Spark dataframe to S3 using AWS Glue

Time:10-31

I'm using this command to write a dataframe to S3:

df.write.option("delimiter","|").option("header",True).option("compression", "gzip").mode("overwrite").format("csv").save("s3://bucketname/metrics/parsed/")

But I'm always getting this error, just the filename keeps changing:

An error occurred while calling o293.save. File already exists:s3://bucketname/metrics/parsed/part-01195-6ef08750-dbf5-41c6-b024-501403820268-c000.csv.gz

Full error:

"Failure Reason": "JobFailed(org.apache.spark.SparkException: Job aborted due to stage failure: Task 1195 in stage 11.0 failed 4 times, most recent failure: 
Lost task 1195.3 in stage 11.0 (TID 3023) (172.36.67.235 executor 9):
 org.apache.hadoop.fs.FileAlreadyExistsException: File already exists

I tried the following but it didn't work, and ends up giving the same error:

  1. Added coalesce(100) in the command
  2. Writing to a new destination, with and without the .mode("overwrite") option
  3. Exporting the data in parquet format
  4. Writing with .mode("append") option

I couldn't find anything helpful which could help resolve this, except this post but I'm using Glue 3.0 (Spark 3.1) hence this shouldn't be applicable.

CodePudding user response:

Turns out the error displayed by Glue was not the correct exception. Although the task stages failed due to this error, but before this there was a stage failure due to an Exception in the code.

After setting up Spark UI on Glue, I was able to find the first failure and the cause of it.

Here's how to setup Spark UI

  • Related