Home > OS >  How to Increase Spark Repartition With Column Expressions Performance
How to Increase Spark Repartition With Column Expressions Performance

Time:11-20

I have a performance problem in repartition and partitionBy operation in Spark.

My df is containing monthly data and i am partitioning data as daily with dailyDt column. My code is like below.

First attempt

This takes 3 minutes to finish, but many small files for each dailyDt partition.

df.repartition(600)
  .write
  .partitionBy("dailyDt")
  .mode(Overwrite)
  .parquet("/path..") 

Second attempt

This is producing only 1 big file foreach day, so its not the solution.

df.repartition(20, $"dailyDt")
  .write
  .partitionBy("dailyDt")
  .mode(Overwrite)
  .parquet("/path..") 

Current solution

Adding salt with 'rand' function, getting 20 files (getting same size for each file) for each day (as expected) but it is taking too much time to run.

import org.apache.spark.sql.functions.rand
        
df.repartition(20, $"dailyDt", rand)
  .write
  .partitionBy("dailyDt")
  .mode(Overwrite)
  .parquet("/path..")

So, i have a solution but its running long. How could i decrease execution time?

CodePudding user response:

fixed it with repartitionByRange.

  • Related