I am struggling to debug a SQL join. I could brute-force it, but I am sure there is tooling for this.
I have 5 tables of that should "just" left join into each other on the common column names.
Something like
SELECT *
FROM users u -- user_id's are unique
JOIN user_groups ug ON ug.user_id = u.user_id
JOIN group_importance gi ON gi.group_id = ug.group_id
However, I'm finding about 10% too many rows in the final output (which I expected to be of the same length as users
)
I implemented the above using
SELECT *
FROM users
NATURAL JOIN user_groups
NATURAL JOIN group_importance
But I can't easily tell the error.
Note that I've checked for duplicate joining keys, and for spurious column overlaps.
Note that I'm using Snowflake/DBT.
CodePudding user response:
Either your user_groups
table contains more than one entry for some users, or your group_importance
table contains more than one entry for some groups, or both.
JOIN gives back all possible pairs of rows from the tables being joined. That is, it gives back all pairs of rows that match the explicit ON condition, or the implicit one generated by NATURAL JOIN. It's no surprise that FROM a NATURAL JOIN b
sometimes yields more rows than are in a
.
You could find the offending rows like this.
SELECT COUNT(*), user_id
FROM user_groups
GROUP BY user_id
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
CodePudding user response:
Using natural joins
make debugging a pain. I would run the query with explicit joins and add a qualify
at the bottom to just look at the dupes. You can partition by additional columns to look at dupes for different groupings
qualify count(*) over (partition by u.user_id) > 1