Home > other >  PySpark - drop rows with duplicate values with no column order
PySpark - drop rows with duplicate values with no column order

Time:01-11

I have a PySpark DataFrame with two columns like so:

 ------------------- -------------------------- 
|       right       |            left          |
 ------------------- -------------------------- 
|          1        |             2            |
|          2        |             3            |
|          2        |             1            |
|          3        |             2            |
|          1        |             1            |
 ------------------- -------------------------- 

I want to drop duplicates but with no respect to the order of the columns.
For example, a row that contains (1,2) and a row that contains (2,1) are duplicates.

The resultant Dataframe would look like this:

 ------------------- -------------------------- 
|       right       |            left          |
 ------------------- -------------------------- 
|          1        |             2            |
|          2        |             3            |
|          1        |             1            |
 ------------------- -------------------------- 

The regular drop_duplicates method doesn't work in this case, anyone has any ideas how to do this cleanly and efficiently?

CodePudding user response:

(df1.withColumn('x',array_sort(array(col('left'), col('right'))))#create sorted array column of columns left and right
 .dropDuplicates(['x'])#Use column created to drop duplicates
 .drop('x')#drop unwanted column
).show()
  • Related