I'm using this command to write a dataframe to S3:
df.write.option("delimiter","|").option("header",True).option("compression", "gzip").mode("overwrite").format("csv").save("s3://bucketname/metrics/parsed/")
But I'm always getting this error, just the filename keeps changing:
An error occurred while calling o293.save. File already exists:s3://bucketname/metrics/parsed/part-01195-6ef08750-dbf5-41c6-b024-501403820268-c000.csv.gz
Full error:
"Failure Reason": "JobFailed(org.apache.spark.SparkException: Job aborted due to stage failure: Task 1195 in stage 11.0 failed 4 times, most recent failure:
Lost task 1195.3 in stage 11.0 (TID 3023) (172.36.67.235 executor 9):
org.apache.hadoop.fs.FileAlreadyExistsException: File already exists
I tried the following but it didn't work, and ends up giving the same error:
- Added
coalesce(100)
in the command - Writing to a new destination, with and without the
.mode("overwrite")
option - Exporting the data in parquet format
- Writing with
.mode("append")
option
I couldn't find anything helpful which could help resolve this, except this post but I'm using Glue 3.0 (Spark 3.1) hence this shouldn't be applicable.
CodePudding user response:
Turns out the error displayed by Glue was not the correct exception. Although the task stages failed due to this error, but before this there was a stage failure due to an Exception in the code.
After setting up Spark UI on Glue, I was able to find the first failure and the cause of it.
Here's how to setup Spark UI