Is there any way other to write to S3 bucket using Pyspark without creating multiple partitioning wh-CodePudding

I have a data frame DF1 with n columns in those n columns we two columns called (storeid,Date). I am trying to take a subset of that DF1 with those two columns while group by storeid and with max(Date) as below:

df2=(df1.select(['storeid','Date']).withColumn("Date", col("Date").cast("timestamp"))\ .groupBy("storeid").agg(max("Date")))

After that i am renaming the column Max(Date) since write to S3 will not allow brackets:

df2=df2.withColumnRenamed('max(Date)', 'max_date')

Now i am trying write this to S3 bucket and it is trying to creating parquet files on each row, like for example if the df2 has 10 rows it is creating 10 parquet files instead of 1 parquet file.

df2.write.mode("overwrite").parquet('s3://path')

Can any one please help me in this, i need the df2 to write only single parquet file instead of many with all the data in it as a table format.

CodePudding user response：

If the DataFrame is not too big, you can try to repartition it to a single partition before writing:

df2.repartition(1).write.mode("overwrite").parquet('s3://path')