File already exists error while writing Spark dataframe to S3 using AWS Glue-CodePudding

I'm using this command to write a dataframe to S3:

df.write.option("delimiter","|").option("header",True).option("compression", "gzip").mode("overwrite").format("csv").save("s3://bucketname/metrics/parsed/")

But I'm always getting this error, just the filename keeps changing:

An error occurred while calling o293.save. File already exists:s3://bucketname/metrics/parsed/part-01195-6ef08750-dbf5-41c6-b024-501403820268-c000.csv.gz

Full error:

"Failure Reason": "JobFailed(org.apache.spark.SparkException: Job aborted due to stage failure: Task 1195 in stage 11.0 failed 4 times, most recent failure: 
Lost task 1195.3 in stage 11.0 (TID 3023) (172.36.67.235 executor 9):
 org.apache.hadoop.fs.FileAlreadyExistsException: File already exists

I tried the following but it didn't work, and ends up giving the same error:

Added coalesce(100) in the command
Writing to a new destination, with and without the .mode("overwrite") option
Exporting the data in parquet format
Writing with .mode("append") option

I couldn't find anything helpful which could help resolve this, except this post but I'm using Glue 3.0 (Spark 3.1) hence this shouldn't be applicable.

CodePudding user response：

Turns out the error displayed by Glue was not the correct exception. Although the task stages failed due to this error, but before this there was a stage failure due to an Exception in the code.

After setting up Spark UI on Glue, I was able to find the first failure and the cause of it.

Here's how to setup Spark UI