Home > Software design >  Make JPQL/QueryDSL not generate terrible queries
Make JPQL/QueryDSL not generate terrible queries

Time:09-29

I'm using QueryDSL 4.4.0 with Hibernate 5.4.32 to query a simple Blog platform towards a PostgreSQL database. My problem is that JPQL and by extension QueryDSL insists on generating truly astoundingly bad queries. I wonder if there's a way I can make it not do that. I'd prefer not having to go to native queries as the queries are already being generated.

I have essentially 3 entities:

@Entity
@Table(indexes = { ... })
public class Note {
    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    @NotNull
    private UUID id;

    @ManyToMany(fetch = FetchType.EAGER)
    private List<Keyword> keywords;

    ...
}

@Entity
@Table(indexes = { @Index(name = "keyword_parent", columnList = "parent_id"), ... })
public class Keyword {
    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    @NotNull
    private UUID id;

    @EqualsAndHashCode.Exclude
    @ManyToOne(fetch = FetchType.LAZY)
    private Keyword parent;

    @ManyToMany(fetch = FetchType.LAZY)
    private List<Keyword> implies = new ArrayList<>();

    @OneToMany(fetch = FetchType.LAZY, mappedBy = "parent", orphanRemoval = false)
    private List<Keyword> children = new ArrayList<>();

    ...
}

@Entity
@Table(indexes = { @Index(columnList = "child_id"), @Index(columnList = "parent_id"),
        @Index(columnList = "child_id,parent_id", unique = true), @Index(columnList = "ref"), ... })
@IdClass(KeywordCacheId.class)
@Where(clause = "ref > 0")
public class KeywordCache implements Serializable {
    /**
     * 
     */
    private static final long serialVersionUID = 1L;
    @NotNull
    @ManyToOne(fetch = FetchType.EAGER)
    @Id
    private Keyword child;

    @NotNull
    @ManyToOne(fetch = FetchType.EAGER)
    @Id
    private Keyword parent;

    private int ref;

    ...
}

public class KeywordCacheId implements Serializable {
    /**
     * 
     */
    private static final long serialVersionUID = 1L;
    private UUID child;
    private UUID parent;

    // equals   hashCode
}

(simplified to only contain main structure)

A note has a number of keywords. Keywords have a hierarchical relationship an auxiliary relationship. That relationship is too complex to handle in SQL, so a cache is built reflecting whether two keywords are in a relationship.

I have 7680 notes, 1308 keywords, just shy of 39k note-keyword relationships, 12 keyword-keyword relationships, and a computed cache of 3002 relationships. In other words, a tiny database.

I want to find all notes which contain keywords that are in relationships with a list of given keyword ids.

My first attempt was

    private JPAQuery<Note> addFilter(JPAQuery<Note> query, List<String> filter) {
        for (String f : filter) {
            UUID id = UUID.fromString(f);
            String variable = id.toString().replaceAll("-", "");
            QKeywordCache cache = new QKeywordCache("kc_"   variable);
            query.from(cache);
            query.where(cache.child.in(QNote.note.keywords));
            query.where(cache.parent.id.eq(id));
        }
        return query;
    }

    public Page<Note> find(List<String> filter, Pageable page) {
        JPAQuery<Note> query = new JPAQuery<>(entityManager);
        query.from(QNote.note);
        query.select(QNote.note);
        query.distinct();
        query = addFilter(query, filter);
        query.offset(page.getOffset());
        query.limit(page.getPageSize());
        QueryResults<Note> data = query.fetchResults();
        return new PageImpl<>(data.getResults(), page, data.getTotal());
    }

this results in meaningful JPQL which is translated to pants-on-head-crazy SQL:

select distinct note
from Note note, KeywordCache kc_6205f3b41e354d63909ef253866371b1
where kc_6205f3b41e354d63909ef253866371b1.child member of note.keywords and kc_6205f3b41e354d63909ef253866371b1.parent.id = ?1

select
    count(distinct note0_.id) as col_0_0_
from
    Note note0_
cross join KeywordCache keywordcac1_
where
    ( keywordcac1_.ref > 0)
    and (keywordcac1_.child_id in (
    select
        keywords2_.keywords_id
    from
        Note_Keyword keywords2_
    where
        note0_.id = keywords2_.Note_id))
    and keywordcac1_.parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'

Aggregate  (cost=10229575.18..10229575.19 rows=1 width=8)
  ->  Nested Loop  (cost=4.30..10229542.12 rows=13222 width=16)
        Join Filter: (SubPlan 1)
        ->  Seq Scan on note note0_  (cost=0.00..206.15 rows=8815 width=16)
        ->  Materialize  (cost=4.30..13.31 rows=3 width=16)
              ->  Bitmap Heap Scan on keywordcache keywordcac1_  (cost=4.30..13.29 rows=3 width=16)
                    Recheck Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
                    Filter: (ref > 0)
                    ->  Bitmap Index Scan on idx1in649xpbjw4aeix3574irbne  (cost=0.00..4.30 rows=3 width=0)
                          Index Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
        SubPlan 1
          ->  Seq Scan on note_keyword keywords2_  (cost=0.00..773.59 rows=5 width=16)
                Filter: (note0_.id = note_id)

The count is because of the pagination. That insane "in" construction makes this query take around 150 seconds.

Replacing instead the filter method by

    private JPAQuery<Note> addFilter(JPAQuery<Note> query, List<String> filter) {
        for (String f : filter) {
            UUID id = UUID.fromString(f);
            String variable = id.toString().replaceAll("-", "");
            QKeywordCache cache = new QKeywordCache("kc_"   variable);
            query.from(cache);
            query.where(QNote.note.keywords.any().eq(cache.child));
//          query.where(cache.child.in(QNote.note.keywords));
            query.where(cache.parent.id.eq(id));
        }
        return query;
    }

I get slightly worse JPQL but SQL which looks overly complex due to an unnecessary subselect:

select distinct note
from Note note, KeywordCache kc_6205f3b41e354d63909ef253866371b1
where exists (select 1
from note.keywords as note_keywords_0
where note_keywords_0 = kc_6205f3b41e354d63909ef253866371b1.child) and kc_6205f3b41e354d63909ef253866371b1.parent.id = ?1

select
    count(distinct note0_.id) as col_0_0_
from
    Note note0_
cross join KeywordCache keywordcac1_
where
    ( keywordcac1_.ref > 0)
    and (exists (
    select
        1
    from
        Note_Keyword keywords2_,
        Keyword keyword3_
    where
        note0_.id = keywords2_.Note_id
        and keywords2_.keywords_id = keyword3_.id
        and keyword3_.id = keywordcac1_.child_id))
    and keywordcac1_.parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'

Aggregate  (cost=3212.04..3212.05 rows=1 width=8)
  ->  Hash Semi Join  (cost=1459.43..3211.99 rows=18 width=16)
        Hash Cond: ((note0_.id = keywords2_.note_id) AND (keywordcac1_.child_id = keywords2_.keywords_id))
        ->  Nested Loop  (cost=4.30..550.01 rows=26445 width=32)
              ->  Seq Scan on note note0_  (cost=0.00..206.15 rows=8815 width=16)
              ->  Materialize  (cost=4.30..13.31 rows=3 width=16)
                    ->  Bitmap Heap Scan on keywordcache keywordcac1_  (cost=4.30..13.29 rows=3 width=16)
                          Recheck Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
                          Filter: (ref > 0)
                          ->  Bitmap Index Scan on idx1in649xpbjw4aeix3574irbne  (cost=0.00..4.30 rows=3 width=0)
                                Index Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
        ->  Hash  (cost=871.22..871.22 rows=38927 width=48)
              ->  Hash Join  (cost=92.43..871.22 rows=38927 width=48)
                    Hash Cond: (keywords2_.keywords_id = keyword3_.id)
                    ->  Seq Scan on note_keyword keywords2_  (cost=0.00..676.27 rows=38927 width=32)
                    ->  Hash  (cost=76.08..76.08 rows=1308 width=16)
                          ->  Seq Scan on keyword keyword3_  (cost=0.00..76.08 rows=1308 width=16)

Now, the query can take advantage of my indices and takes around 120ms. It still uses a silly subselect that is entirely unnecessary, though. Writing a query by hand, I get

select
    count(distinct note0_.id) as col_0_0_
from
    Note note0_,
    Note_Keyword keywords2_,
    KeywordCache keywordcac1_
where
    keywordcac1_.ref > 0
    and note0_.id = keywords2_.Note_id
    and keywords2_.keywords_id = keywordcac1_.child_id
    and keywordcac1_.parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'

Aggregate  (cost=816.18..816.19 rows=1 width=8)
  ->  Nested Loop  (cost=13.61..815.99 rows=75 width=16)
        ->  Hash Join  (cost=13.33..792.09 rows=75 width=16)
              Hash Cond: (keywords2_.keywords_id = keywordcac1_.child_id)
              ->  Seq Scan on note_keyword keywords2_  (cost=0.00..676.27 rows=38927 width=32)
              ->  Hash  (cost=13.29..13.29 rows=3 width=16)
                    ->  Bitmap Heap Scan on keywordcache keywordcac1_  (cost=4.30..13.29 rows=3 width=16)
                          Recheck Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
                          Filter: (ref > 0)
                          ->  Bitmap Index Scan on idx1in649xpbjw4aeix3574irbne  (cost=0.00..4.30 rows=3 width=0)
                                Index Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
        ->  Index Only Scan using note_pkey on note note0_  (cost=0.29..0.32 rows=1 width=16)
              Index Cond: (id = keywords2_.note_id)

This query takes 30ms. While the improvement from 120ms to 30ms may seem insignificant, it is still a factor 4 and this is a central loop in my application, so I'd like to keep it fast. Especially since I can add multiple keywords (and expect normal uses to have 3-6 keywords in the list) and plan to also add sorting, so the subselect must be efficient (or absent).

So, is there a way to make QueryDSL generate better JPQL in the second case or make JPQL (as implemented by Hibernate) stop fetishising wild "in" subselects for containers as in the first case?

CodePudding user response:

Thanks to Jan-Willem Gmelig Meyling's comment, I got it working using an explicit join with Keyword:

    private JPAQuery<Note> addFilter(JPAQuery<Note> query, List<String> filter) {
        for (String f : filter) {
            UUID id = UUID.fromString(f);
            String variable = id.toString().replaceAll("-", "");
            QKeywordCache cache = new QKeywordCache("kc_"   variable);
            QKeyword keyword = new QKeyword("k_"   variable);
            query.from(cache);
            query.innerJoin(QNote.note.keywords, keyword);
            query.where(keyword.eq(cache.parent));
            query.where(cache.parent.id.eq(id));
        }
        return query;
    }

It leads to a query closer to what I'd written by hand:

select distinct note
from Note note, KeywordCache kc_6205f3b41e354d63909ef253866371b1
  inner join note.keywords as k_6205f3b41e354d63909ef253866371b1
where k_6205f3b41e354d63909ef253866371b1 = kc_6205f3b41e354d63909ef253866371b1.parent and kc_6205f3b41e354d63909ef253866371b1.parent.id = ?1

select
    count(distinct note0_.id) as col_0_0_
from
    Note note0_
cross join KeywordCache keywordcac1_
inner join Note_Keyword keywords2_ on
    note0_.id = keywords2_.Note_id
inner join Keyword keyword3_ on
    keywords2_.keywords_id = keyword3_.id
where
    ( keywordcac1_.ref > 0)
    and keyword3_.id = keywordcac1_.parent_id
    and keywordcac1_.parent_id ='98c9201c-a395-4ac4-9348-ea89e740653b'

Aggregate  (cost=812.06..812.07 rows=1 width=8)
  ->  Nested Loop  (cost=4.87..812.04 rows=6 width=16)
        ->  Bitmap Heap Scan on keywordcache keywordcac1_  (cost=4.30..13.45 rows=3 width=16)
              Recheck Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
              Filter: (ref > 0)
              ->  Bitmap Index Scan on idx1in649xpbjw4aeix3574irbne  (cost=0.00..4.30 rows=3 width=0)
                    Index Cond: (parent_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
        ->  Materialize  (cost=0.56..798.52 rows=2 width=32)
              ->  Nested Loop  (cost=0.56..798.51 rows=2 width=32)
                    ->  Index Only Scan using keyword_pkey on keyword keyword3_  (cost=0.28..8.29 rows=1 width=16)
                          Index Cond: (id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
                    ->  Nested Loop  (cost=0.29..790.19 rows=2 width=32)
                          ->  Seq Scan on note_keyword keywords2_  (cost=0.00..773.59 rows=2 width=32)
                                Filter: (keywords_id = '98c9201c-a395-4ac4-9348-ea89e740653b'::uuid)
                          ->  Index Only Scan using note_pkey on note note0_  (cost=0.29..8.30 rows=1 width=16)
                                Index Cond: (id = keywords2_.note_id)

The extra join is more than made up for by avoiding the subselect, making performance similar to the hand-written query.

  • Related