I would like to get the duplicate count for each column in a dataframe individually.
Input
id | age | zip |
---|---|---|
1 | 22 | 12345 |
1 | 32 | 34561 |
2 | 32 | 12345 |
3 | 44 | 34561 |
2 | 45 | 67890 |
4 | 66 | 67890 |
The output should contain the count of duplicates for each column
Output:
id | age | zip |
---|---|---|
2 | 1 | 3 |
For a single column I am able to do a group by and count
df.groupBy("id").count.filter("count > 1").count()
But how do I do it for all the columns and get the output in the desired format?
Also is it possible to do it by spark sql?
CodePudding user response:
navin
You can use the Spark custom accumulator;
This 100% solves this problem!
CodePudding user response:
with
cte_id(
select count("id") as id_c from df group by id
),
cte_age(
select count("age") as age_c from df group by age
),
cte_zip(
select count("zip") as zip_c from df group by zip
),
cte_row(
select "id" as col, sum(case when id_c > 1 then 1 else 0 end) as val from cte_id
union
select "age", sum(case when age_c > 1 then 1 else 0 end) from cte_age
union
select "zip", sum(case when zip_c > 1 then 1 else 0 end) from cte_zip
)
SELECT * FROM cte_row
pivot(
sum(val) for col in ("id", "age", "zip")
)
You can't achieve it directly, above code should serve the purpose. So here cte_id will holds each id occurred how many times, same for cte_age and cte_zip. cte_row will holds the how many id, age or zip exist which have duplicate. pivot is used to transform
--- ---
|col|val|
--- ---
| id| 2|
|age| 1|
|zip| 3|
--- ---
to
--- --- ---
| id|age|zip|
--- --- ---
| 2| 1| 3|
--- --- ---