Home > Mobile >  Spark (Scala) count duplicate values for each column individually in a dataframe
Spark (Scala) count duplicate values for each column individually in a dataframe

Time:02-25

I would like to get the duplicate count for each column in a dataframe individually.

Input

id age zip
1 22 12345
1 32 34561
2 32 12345
3 44 34561
2 45 67890
4 66 67890

The output should contain the count of duplicates for each column

Output:

id age zip
2 1 3

For a single column I am able to do a group by and count df.groupBy("id").count.filter("count > 1").count()

But how do I do it for all the columns and get the output in the desired format?

Also is it possible to do it by spark sql?

CodePudding user response:

navin
You can use the Spark custom accumulator;
This 100% solves this problem!

CodePudding user response:

with
cte_id(
  select count("id") as id_c from df group by id
),
cte_age(
  select count("age") as age_c from df group by age
),
cte_zip(
  select count("zip") as zip_c from df group by zip
),
cte_row(
select "id" as col, sum(case when id_c > 1 then 1 else 0 end) as val from cte_id
union
select "age", sum(case when age_c > 1 then 1 else 0 end) from cte_age
union
select "zip", sum(case when zip_c > 1 then 1 else 0 end) from cte_zip
)
SELECT * FROM cte_row 
pivot(
  sum(val) for col in ("id", "age", "zip")
)

You can't achieve it directly, above code should serve the purpose. So here cte_id will holds each id occurred how many times, same for cte_age and cte_zip. cte_row will holds the how many id, age or zip exist which have duplicate. pivot is used to transform

 --- --- 
|col|val|
 --- --- 
| id|  2|
|age|  1|
|zip|  3|
 --- --- 

to

 --- --- --- 
| id|age|zip|
 --- --- --- 
|  2|  1|  3|
 --- --- --- 
  • Related