Home > Software design >  Spark coalesce changing the order of unionAll
Spark coalesce changing the order of unionAll

Time:02-11

I have 2 data frames that I try to perform unionAll on.

DF3=DF1.unionAll(DF2)

DF3.coalesce(1).write.csv("/location")

DF1 always is placed under DF2 after coalesce and I see the reason is because the smaller partitions comes last as per this: https://stackoverflow.com/a/59838761/3357735 .

Is there any way that we can have the same order as my union? is DF1 comes first and DF2 after coalesce.

CodePudding user response:

Did you try using row_number before coalesce?

from pyspark.sql import SparkSession, functions as F
from pyspark.sql.window import Window

DF3 = DF1.unionAll(DF2)\
    .withColumn("p", F.row_number().over(Window.orderBy(F.lit(None))))\
    .coalesce(1)\
    .orderBy(F.col("p"))\
    .drop("p")

DF3.write.csv("/location")
  1. I'm creating a new column called "p" and assigning an incremental number to each row using row_number and order By None (You can use any column name if you wish to order by that particular column)
  2. After coalesce we are sorting the rows back to the initial order
  3. dropping the column "p" before writing the csv
  • Related