Home > database >  SQL query to filter on group of related rows
SQL query to filter on group of related rows

Time:11-30

I have a recurring problem in SQL queries, that I haven't been able to solve elegantly, neither in raw SQL or the Django ORM, and now I'm faced with it in EntityFramework as well. It is probably common enough to have its own name, but I don't know it.

Say, I have a simple foreign key relationship between two tables, e.g.

Book 1 <- * Tag

A book has many tags and a tag has one book, i.e. the Tag table has a foreign key to the book table.

Now, I want to find all books that have "Tag1" and "Tag2".

Raw SQL

I can make multiple joins

SELECT * FROM books
JOIN tags t1 on tags.book_id = books.id
JOIN tags t2 on tags.book_id = books.id
WHERE t1.tag = 'Tag1' AND t2.tag = 'Tag2'

Cool, that works, but doesn't really seem performant

Django

In django, I could do something similar

Book.objects.filter(tags__tag="Tag1").filter(tags__tag="Tag1")

Changing filters like that will cause the extra joins, like in the raw SQL version

EntityFramework LINQ

I tried chaining .Where() similar to changing Django's .filter(), but that does not have the same result. It will build a query resembling the following, which will of course return nothing, because there is no row where the tag are two different strings

SELECT * FROM books
JOIN tags t1 on tags.book_id = books.id
WHERE t1.tag = 'Tag1' AND t1.tag = 'Tag2'

Wrapping it up

I suppose I could do an array aggregate to aggregate tags into and array and compare to that, but that seems expensive too, and aggregates and grouping also have impact on the ordering of things, which forces me to do subqueries to get the order I want.

I am by no means an expert in SQL, as you can plainly see, but I guess what I am hoping for is either

  1. A way to mimic the stupid nonsense above in LINQ
  2. An alternative, more elegant approach that will let me do what I need and which works well with any ORM

Extra ramblings

This case where I need to find books that have "all of" a list of tags is the tricky bit... If it was "any of" or "this particular one", then it would be simple.

EDIT: The solution using arrays and overlap

In Postgres, we can do array_agg to aggregate all related tags into an array, like this:

SELECT * FROM books
JOIN tags t1 on tags.book_id = books.id
;

 -------- ------- ------ 
| BookId | Name  | Tag  |
 -------- ------- ------ 
|      1 | BookA | Tag1 |
|      1 | BookA | Tag2 |
|      1 | BookA | Tag3 |
|      2 | BookB | Tag1 |
|      2 | BookB | Tag3 |
 -------- ------- ------ 


SELECT books.BookId, Name, array_agg(t1.tags) as tags
FROM books
JOIN tags t1 on tags.book_id = books.id
GROUP BY BookId
ORDER BY BookId
;

 -------- ------- -------------------- 
| BookId | Name  |        tags        |
 -------- ------- -------------------- 
|      1 | BookA | [Tag1, Tag2, Tag3} |
|      2 | BookB | {Tag1, Tag3}       |
 -------- ------- -------------------- 

With that, I can then use the array "contains" operator to find the row where tag overlaps with the expected set: WHERE tags @> ('Tag1', 'Tag2').

This is also a viable option. It does aggregation instead of excessive joining. Not sure what that would look like with LINQ query though

CodePudding user response:

With group by and having, we can do Tag 1 AND Tag 2:

with cte_tags as (
select book_id
  from tags
 where tag in ('Tag 1', 'Tag 2')
 group by book_id
 having count(*)=2)
select b.id as book_id,
       b.name
  from books b
  join cte_tags t
    on b.id = t.book_id;

EDIT:

  1. If tag in tags for a book_id can be duplicated, use the following:
with cte_tags as (
select book_id
  from tags
 where tag in ('Tag 1', 'Tag 2')
 group by book_id
 having count(distinct tag)=2)
select b.id as book_id,
       b.name
  from books b
  join cte_tags t
    on b.id = t.book_id;
  1. If looking for tag1 OR tag2, or tag1 AND tag2, use the following:
with cte_tags as (
select book_id
  from tags
 where tag in ('Tag 1', 'Tag 2')
 group by book_id
 having count(distinct tag) between 1 and 2)
select b.id as book_id,
       b.name
  from books b
  join cte_tags t
    on b.id = t.book_id;

CodePudding user response:

If I understand it correctly you want books that only have tags 'Tag1' and 'Tag2'. I.e. no other tags. I'm not aware of an official name for this problem, maybe exclusive contains.

It amounts to finding books meeting two conditions:

  • have all tags in ("Tag1", "Tag2")
  • have two unique tags

Since you're looking for an Entity-Framework solution, here's the way to do it in LINQ:

var tags = new[] { "Tag1", "Tag2" };
var books = context.Books
    .Where(b => b.Tags.All(t => tags.Contains(t.Tag)) 
        && b.Tags.Select(t.Tag).Distinct().Count == tags.Count());

The second condition is necessary, otherwise books without any tags would also be selected (that's the semantics of All).

  • Related