SQL query to select nodes with no parent-CodePudding

Suppose I've got a table node with two fields : id, and child_id :

id | name      | child_id  | 
----------------------------
1  | node1     | NULL      | 
2  | node2     | 3         |   
3  | node3     | NULL      |

It means that "node1" has no parent and no children, "node2" has no parent and child "node2", and "node2" has parent "node2" and no children.

Now I want to select all "node" that have no parent.
In the example above I should get rows :

id | name      | child_id  | 
----------------------------
1  | node1     | NULL      | 
2  | node2     | 3         |

How would you implement this query ?

CodePudding user response：

A traditional anti-join will produce the rows you want. For example:

select c.*
from node c
left join node p on c.id = p.child_id
where p.id is null

CodePudding user response：

I had to think about this, because it has the parent-child-relationships in opposite direction compared to traditional tree structures (where a node has a reference to its parent, not its child).

If you want all nodes without parents in your data structure, you want all nodes whose id is not used as a child id in other nodes. Right? (Because such other nodes would be the parents of that node.)

That could result in a query like this:

select *
from   node
where  id not in (select child_id
                  from   node
                  where  child_id is not null)

(Note that you have to be careful with three-valued logic in SQL, which could mess up a NOT IN clause. That's why I included an explicit WHERE-clause in the subquery so that that subquery only includes "valid" rows.)

By the way, the alternative query in the answer of The Impaler (which uses a join instead of a subquery) is quite fine as well. I'm not sure, but that query might even be (marginally) more efficient. However, I guess that a subquery expresses the intention of the main query more clearly, making it somewhat easier to read and to maintain. So I would personally stick with the subquery-query at first and move to the join-query only if that turns out to have a substantial and relevant performance benefit. But that's just my own humble opinion, of course. (And in such a case, I would keep the subquery-query as a comment in the code/script for documentation purposes.)