Home > Software engineering >  Applying PySpark dropDuplicates method messes up the sorting of the data frame
Applying PySpark dropDuplicates method messes up the sorting of the data frame

Time:11-10

I'm not sure why this is the behaviour, but when I apply dropDuplicates to a sorted data frame, the sorting order is disrupted. See the following two tables in comparison.

The following table is the output of sorted_df.show(), in which the sorting is in order.

 ---------- ----------- 
|sorted_col|another_col|
 ---------- ----------- 
|         1|          1|
|         8|          5|
|        15|          1|
|        19|          9|
|        20|          7|
|        27|          9|
|        67|          8|
|        91|          9|
|        91|          7|
|        91|          1|
 ---------- ----------- 

The following table is the output of sorted_df.dropDuplicates().show(), and the sorting is not right anymore, even though it's the same data frame.

 ---------- ----------- 
|sorted_col|another_col|
 ---------- ----------- 
|        27|          9|
|        67|          8|
|        15|          1|
|        91|          7|
|         1|          1|
|        91|          1|
|         8|          5|
|        91|          9|
|        20|          7|
|        19|          9|
 ---------- ----------- 

Can someone explain why this behaviour persists and how can I keep the same sorting order with dropDuplicates applied?

Apache Spark version 3.1.2

CodePudding user response:

dropDuplicates involves a shuffle. Ordering is therefore disrupted.

  • Related