SQL count(distinct) from both the table-CodePudding

I have 2 tables. Let's say Table A and Table B. Table A has a column called "name". Table B also has a column "name". I want to find out the count(distinct name). Name should take values from both the columns.

For ex-

Table A

Table B

name   
A   
B   
D

Output should be 4.

CodePudding user response：

The best concept is, first combine the data in the way you want using a subquery, and then dedupe or do the 2nd step.

For example,

WITH COMBINED AS (
  SELECT 
    name 
  FROM 
    TableA 

  UNION ALL 

  SELECT 
    name 
  FROM 
    TableB
) 
SELECT 
  DISTINCT name 
FROM 
  COMBINED

In your situation, the 2nd step can be accomplished by changing UNION ALL to a UNION. This will dedupe the values automatically. You won't even need a subquery or a 2nd step. But I wanted to teach you the concept because it comes up often.

SELECT name FROM TableA

UNION

SELECT name FROM TableB

CodePudding user response：

Then UNION in the CTE will reove all Duplicates

so a COUNT(*) will suffoce

WITH CTE AS (
    SELECT name FROM TableA
       UNION
     SELECT name FROM TableB
)
SELECT COUNT(*) FROM CTE

CodePudding user response：

I hope this query should do it:

SELECT SUM(names) AS total_names FROM ( SELECT COUNT(DISTINCT(name)) as names FROM TableA UNION SELECT COUNT(DISTINCT(name)) as names FROM TableB ) t;

Note: Tested with sql server

CodePudding user response：

Yet another option:

select hll_count.merge(hll_sketch) names 
from (
  select hll_count.init(name) hll_sketch from tableA
  union all
  select hll_count.init(name) from tableB
)

HLL functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical error. This makes HLL functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.

See more about benefits of using HyperLogLog functions