Spark (Scala) count duplicate values for each column individually in a dataframe-CodePudding

I would like to get the duplicate count for each column in a dataframe individually.

Input

id	age	zip
1	22	12345
1	32	34561
2	32	12345
3	44	34561
2	45	67890
4	66	67890

The output should contain the count of duplicates for each column

Output:

id	age	zip
2	1	3

For a single column I am able to do a group by and count df.groupBy("id").count.filter("count > 1").count()

But how do I do it for all the columns and get the output in the desired format?

Also is it possible to do it by spark sql?

CodePudding user response：

navin
You can use the Spark custom accumulator;
This 100% solves this problem!

CodePudding user response：

with
cte_id(
  select count("id") as id_c from df group by id
),
cte_age(
  select count("age") as age_c from df group by age
),
cte_zip(
  select count("zip") as zip_c from df group by zip
),
cte_row(
select "id" as col, sum(case when id_c > 1 then 1 else 0 end) as val from cte_id
union
select "age", sum(case when age_c > 1 then 1 else 0 end) from cte_age
union
select "zip", sum(case when zip_c > 1 then 1 else 0 end) from cte_zip
)
SELECT * FROM cte_row 
pivot(
  sum(val) for col in ("id", "age", "zip")
)

You can't achieve it directly, above code should serve the purpose. So here cte_id will holds each id occurred how many times, same for cte_age and cte_zip. cte_row will holds the how many id, age or zip exist which have duplicate. pivot is used to transform

 --- --- 
|col|val|
 --- --- 
| id|  2|
|age|  1|
|zip|  3|
 --- ---

 --- --- --- 
| id|age|zip|
 --- --- --- 
|  2|  1|  3|
 --- --- ---