Home > database >  How to get distinct count over multiple columns in Hive SQL?
How to get distinct count over multiple columns in Hive SQL?

Time:08-29

I have a table that looks like this. And I want to get the distinct count across the three columns.

ID Column1 Column 2 Column 3
1 A B C
2 A A B
3 A A

The desired output I'm looking for is:

ID Column1 Column 2 Column 3 unique_count
1 A B C 3
2 A A B 2
3 A A 1

CodePudding user response:

1  
case when C1 not in (C2, C3) then 1 else 0 end  
case when C2 not in (C3)     then 1 else 0 end

This will not work if you intend to count nulls. The pattern would extend to more columns by successively comparing each one to all columns to its right. The order doesn't strictly matter. There's just no point in repeating the same test over and over.

If they were alphabetically ordered them you could test only adjacent pairs to look for differences. While that applies to your limited sample it would not be the most general case.

CodePudding user response:

One possible option would be

WITH sample AS (
  SELECT 'A' Column1, 'B' Column2, 'C' Column3 UNION ALL
  SELECT 'A', 'A', 'B' UNION ALL
  SELECT 'A', 'A', NULL
)
SELECT Column1, Column2, Column3, COUNT(DISTINCT c) unique_count
  FROM (SELECT *, ROW_NUMBER() OVER () rn FROM sample) t LATERAL VIEW EXPLODE(ARRAY(Column1, Column2, Column3)) tf AS c
 GROUP BY Column1, Column2, Column3, rn;
output
 --------- --------- --------- -------------- 
| column1 | column2 | column3 | unique_count |
 --------- --------- --------- -------------- 
| A       | A       | NULL    |            1 |
| A       | A       | B       |            2 |
| A       | B       | C       |            3 |
 --------- --------- --------- -------------- 
  • Related