I have 2 tables. t1.uuid
has 630,000 distinct values. t2.uuid
has 300,000 distinct values.
When I run
SELECT
t1.uuid
, t2.uuid
FROM
t1 --630,000 uuids
LEFT OUTER JOIN t2 -- 300,000 uuids
ON t1.uuid = t2.uuid
WHERE
t2.uuid IS NULL
There are no results.
Edit1: For clarification: t1 looks like:
uuid |
---|
ufo123 |
abc456 |
def789 |
t2 looks like
uuid |
---|
ufo123 |
def789 |
Every record in t2 has a match in t1. Not every record in t1 has a match in t2 (as indicated by the size of the tables.
I expect the join to result in:
t1.uuid | t2.uuid |
---|---|
ufo123 | ufo123 |
abc456 | NULL |
def789 | def789 |
And the result of my query to by:
t1.uuid | t2.uuid |
---|---|
abc456 | NULL |
But instead I get no results. I'm not sure if there is a communication issue between Redshift and DBEAVER to produce this behavior.
Edit2: I ran the following which resulted in only 300,000 records (same as only t2):
SELECT
COUNT(DISTINCT t1.uuid)
FROM
t1 --630,000 uuids
FULL JOIN t2 -- 300,000 uuids
ON t1.uuid = t2.uuid
This is completely wrong.
CodePudding user response:
Because you placed the predicate t2.uuid IS NULL
in the WHERE
clause, rather than in the ON
clause. In the place where it is, this predicate silently converts the outer join into an inner join.
To make the join work as an outer join, move the predicate to the ON
clause, as in:
SELECT
t1.uuid
, t2.uuid
FROM
t1 --630,000 uuids
LEFT OUTER JOIN t2 -- 300,000 uuids
ON t1.uuid = t2.uuid
AND t2.uuid IS NULL
CodePudding user response:
Your logic looks fine. Testing on a sample data. 3 records in table a
and 1 record in table b
. Output is 2 records that don't match.
WITH t1 AS (SELECT 1 AS col UNION SELECT 2 UNION SELECT 3)
, t2 AS (SELECT 1 AS col)
SELECT *
FROM t1
LEFT JOIN t2 ON t1.col = t2.col
where t2.col is null
Output
------ -------
|t1.col|t2.col |
------ -------
|2 |NULL |
|3 |NULL |
------ -------
I'd recommend checking the data that it's matching on. As Pelayo Martinez mentioned, it's highly likely your LEFT table is T2 instead of T1. Only then it'll be a no row result.
CodePudding user response:
The problem is that you cannot check NULL with equal sign in Redshift. "NULL = NULL" is false.
You need to expand your JOIN ON clause (I left it general case which isn't needed with you WHERE clause):
SELECT
t1.uuid
, t2.uuid
FROM
t1 --630,000 uuids
LEFT OUTER JOIN t2 -- 300,000 uuids
ON (t1.uuid = t2.uuid) OR (t1.uuid IS NULL AND t2.uuid IS NULL)
WHERE
t2.uuid IS NULL
The problem is that when both t1 and t2 have NULL uuid these will all join to each other and could explode your results. So you may want to check this in a different way if there are a lot of NULLs.
As for an explanation let me address the "why" of this. This has to do with the Redshift inferring a WHERE clause from your SQL. It goes like this - if t2.uuid = X and t1.uuid = t2.uuid then Redshift should be able to apply the WHERE clause t1.uuid = X. But this produces the null set give the reasons above. Redshift does this to reduce the scanned data as much as possible.
Update:
Note that you are recreating the EXCEPT clause with this LEFT OUTER JOIN. You will likely get better results with EXCEPT.