Home > database >  How to remove duplicates from DataFrame in Spark basing on particular columns?
How to remove duplicates from DataFrame in Spark basing on particular columns?

Time:03-23

How to get DataFrame which is not contain the row duplicates?(Containing only unique rows, which "uniqness" hold on particular columns)

for example? we have DataFrame:

1 2 3 4
2 3 4 5
2 3 4 4 

we need to remove rows with equal first three fields:

e see that 2 3 4 = 2 3 4 So result Dataframe would be

1 2 3 4

CodePudding user response:

I am using spark with left anti join to achieve the desire output.

from pyspark.sql import functions as F

data = [[1,2,3,4],[2,3,4,5],[2,3,4,4]]
schema = ['id','id1','id2','id3']
df = spark.createDataFrame(data,schema)
# here i am just concateing the the field on which we need to remove duplicate records 

df2 = df.withColumn("concat",F.concat_ws('',F.col("id"),F.col("id1"),F.col("id2")))

df2.registerTempTable("temp")
result = spark.sql("select concat, count(1) from temp group by concat having count(1)>1")

df2.join(result, result.concat==df2.concat,how="left_anti").drop("concat").show()

 --- --- --- --- 
| id|id1|id2|id3|
 --- --- --- --- 
|  1|  2|  3|  4|
 --- --- --- --- 

Let me know if you need further clarification on it.

CodePudding user response:

Adding alternative approach to Mahesh Answer -

spark.sql("""
select 1 as t1,2 as t2,3 as t3,4 as t4 union all
select 2 as t1,3 as t2,4 as t3,5 as t4  union all
select 2 as t1,3 as t2,4 as t3,4 as t4 
""").createOrReplaceTempView("table1")
spark.sql("""select * from table1 where concat(t1,t2,t3) not in (select concat(t1,t2,t3) as c1 from table1 group by c1 having count(*) >1)""").show()

output -

 --- --- --- --- 
| t1| t2| t3| t4|
 --- --- --- --- 
|  1|  2|  3|  4|
 --- --- --- --- 
  • Related