I have a performance problem in repartition and partitionBy operation in Spark.
My df
is containing monthly data and i am partitioning data as daily
with dailyDt
column. My code is like below.
First attempt
This takes 3 minutes to finish, but many small files for each dailyDt partition.
df.repartition(600)
.write
.partitionBy("dailyDt")
.mode(Overwrite)
.parquet("/path..")
Second attempt
This is producing only 1 big file foreach day, so its not the solution.
df.repartition(20, $"dailyDt")
.write
.partitionBy("dailyDt")
.mode(Overwrite)
.parquet("/path..")
Current solution
Adding salt with 'rand' function, getting 20 files (getting same size for each file) for each day (as expected) but it is taking too much time to run.
import org.apache.spark.sql.functions.rand
df.repartition(20, $"dailyDt", rand)
.write
.partitionBy("dailyDt")
.mode(Overwrite)
.parquet("/path..")
So, i have a solution but its running long. How could i decrease execution time?
CodePudding user response:
fixed it with repartitionByRange.