I have 2 data frames that I try to perform unionAll on.
DF3=DF1.unionAll(DF2)
DF3.coalesce(1).write.csv("/location")
DF1 always is placed under DF2 after coalesce and I see the reason is because the smaller partitions comes last as per this: https://stackoverflow.com/a/59838761/3357735 .
Is there any way that we can have the same order as my union? is DF1 comes first and DF2 after coalesce.
CodePudding user response:
Did you try using row_number before coalesce?
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.window import Window
DF3 = DF1.unionAll(DF2)\
.withColumn("p", F.row_number().over(Window.orderBy(F.lit(None))))\
.coalesce(1)\
.orderBy(F.col("p"))\
.drop("p")
DF3.write.csv("/location")
- I'm creating a new column called "p" and assigning an incremental number to each row using row_number and order By None (You can use any column name if you wish to order by that particular column)
- After coalesce we are sorting the rows back to the initial order
- dropping the column "p" before writing the csv