I have a PySpark DataFrame with two columns like so:
------------------- --------------------------
| right | left |
------------------- --------------------------
| 1 | 2 |
| 2 | 3 |
| 2 | 1 |
| 3 | 2 |
| 1 | 1 |
------------------- --------------------------
I want to drop duplicates but with no respect to the order of the columns.
For example, a row that contains (1,2) and a row that contains (2,1) are duplicates.
The resultant Dataframe would look like this:
------------------- --------------------------
| right | left |
------------------- --------------------------
| 1 | 2 |
| 2 | 3 |
| 1 | 1 |
------------------- --------------------------
The regular drop_duplicates method doesn't work in this case, anyone has any ideas how to do this cleanly and efficiently?
CodePudding user response:
(df1.withColumn('x',array_sort(array(col('left'), col('right'))))#create sorted array column of columns left and right
.dropDuplicates(['x'])#Use column created to drop duplicates
.drop('x')#drop unwanted column
).show()