Writing parquet files into existing AWS S3 bucket-CodePudding

versions: emr-5.33.1, pyspark2.4.7

Trying to iteratively read subset of data, transform it and then save them to save bucket. path looks something like this: {bucket_uri}/folder1/date=20220101/

Writing is possible when there are no such date=20220101 partition folder however it outputs

pyspark.sql.utils.AnalysisException: 'path ... already exists'

My code looks something like follows:

output_path = 'bucket_uri/folder1/date=20220101'
for i in range(0, 100, 10):
    pdf = spark.read.parquet(file_list[i:i 10])
    .... doing transformations....
    pdf_transformed.write.parquet(output_path)

I could add extra layer by writing pyspark df in each iteration to differnt folder bucket_uri/folder1/date=20220101/iteration{i} but I want to keep all parquet files in one bucekt.

CodePudding user response：

You need to specify the mode- either append or overwrite while writing the dataframe to S3. Append mode will keep the existing data and add the new data to the same folder whereas overwrite will remove the existing data and writes the new data. So at the end, it boils down to whether you want to keep the existing data in the output path or not.

pdf_transformed.write.mode(“append”).parquet(output_path) #if you want to append data



pdf_transformed.write.mode(“overwrite”).parquet(output_path) #if you want to overwrite the data in the output path