I'm not sure why this is the behaviour, but when I apply dropDuplicates
to a sorted data frame, the sorting order is disrupted. See the following two tables in comparison.
The following table is the output of sorted_df.show()
, in which the sorting is in order.
---------- -----------
|sorted_col|another_col|
---------- -----------
| 1| 1|
| 8| 5|
| 15| 1|
| 19| 9|
| 20| 7|
| 27| 9|
| 67| 8|
| 91| 9|
| 91| 7|
| 91| 1|
---------- -----------
The following table is the output of sorted_df.dropDuplicates().show()
, and the sorting is not right anymore, even though it's the same data frame.
---------- -----------
|sorted_col|another_col|
---------- -----------
| 27| 9|
| 67| 8|
| 15| 1|
| 91| 7|
| 1| 1|
| 91| 1|
| 8| 5|
| 91| 9|
| 20| 7|
| 19| 9|
---------- -----------
Can someone explain why this behaviour persists and how can I keep the same sorting order with dropDuplicates
applied?
Apache Spark version 3.1.2
CodePudding user response:
dropDuplicates
involves a shuffle
. Ordering is therefore disrupted.