I have 2 tables. Let's say Table A and Table B. Table A has a column called "name". Table B also has a column "name". I want to find out the count(distinct name). Name should take values from both the columns.
For ex-
Table A
name
A
B
C
Table B
name
A
B
D
Output should be 4.
CodePudding user response:
The best concept is, first combine the data in the way you want using a subquery, and then dedupe or do the 2nd step.
For example,
WITH COMBINED AS (
SELECT
name
FROM
TableA
UNION ALL
SELECT
name
FROM
TableB
)
SELECT
DISTINCT name
FROM
COMBINED
In your situation, the 2nd step can be accomplished by changing UNION ALL
to a UNION
. This will dedupe the values automatically. You won't even need a subquery or a 2nd step. But I wanted to teach you the concept because it comes up often.
SELECT name FROM TableA
UNION
SELECT name FROM TableB
CodePudding user response:
Then UNION in the CTE will reove all Duplicates
so a COUNT(*) will suffoce
WITH CTE AS (
SELECT name FROM TableA
UNION
SELECT name FROM TableB
)
SELECT COUNT(*) FROM CTE
CodePudding user response:
I hope this query should do it:
SELECT SUM(names) AS total_names FROM ( SELECT COUNT(DISTINCT(name)) as names FROM TableA UNION SELECT COUNT(DISTINCT(name)) as names FROM TableB ) t;
Note: Tested with sql server
CodePudding user response:
Yet another option:
select hll_count.merge(hll_sketch) names
from (
select hll_count.init(name) hll_sketch from tableA
union all
select hll_count.init(name) from tableB
)
HLL functions are approximate aggregate functions. Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical error. This makes HLL functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.
See more about benefits of using HyperLogLog functions