Efficiently ordering by the number of matched columns in Postgres?-CodePudding

I have a denormalized table where each column is a tag associated with an object. E.g. this post could have the tags "postgres", "indexing", and "similarity". The table looks something like:

| id | col1     | col2     | col3       |
| 1  | postgres | indexing | similarity |
| 2  | postgres | foo      | bar        |
| 3  | foo      | bar      | baz        |

If I wanted to find the post most similar to this one, I could do something like:

select *
from mytable
order by (col1 = 'postgres' or col2 = 'postgres' or col3 = 'postgres')::int
    (col1 = 'indexing' or col2 = 'indexing' or col3 = 'indexing')::int
    (col1 = 'similarity' or col2 = 'similarity' or col3 = 'similarity')::int desc
limit 10;

However, as far as I know, this can't use any indexes for sorting; it has to do a table scan. When I ran it on a similar table with every column indexed, I got this:

Limit  (cost=35124.06..35125.23 rows=10 width=21) (actual time=204.323..206.350 rows=10 loops=1)
  ->  Gather Merge  (cost=35124.06..132353.15 rows=833334 width=21) (actual time=204.322..206.344 rows=10 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Sort  (cost=34124.04..35165.70 rows=416667 width=21) (actual time=194.794..194.795 rows=10 loops=3)
              Sort Key: ((((((col1 = 'aa'::text) OR (col2 = 'aa'::text) OR (col3 = 'aa'::text)))::integer   (((col1 = 'bb'::text) OR (col2 = 'bb'::text) OR (col3 = 'bb'::text)))::integer)   (((col1 = 'cc'::text) OR (col2 = 'cc'::text) OR (col3 = 'cc'::text)))::integer)) DESC
              Sort Method: top-N heapsort  Memory: 25kB
              Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
              Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
              ->  Parallel Seq Scan on "__testTags__"  (cost=0.00..25120.01 rows=416667 width=21) (actual time=0.016..145.436 rows=333333 loops=3)
Planning Time: 0.139 ms
Execution Time: 206.371 ms

Is there a way to order by the number of matching columns more efficiently? I looked into using a vector embedding, but Postgres doesn't have any good vector or nearest-neighbor support. Also, fulltext search doesn't seem to support ranking by the number of matches.

The only solution I can think of is doing 4 separate queries: get rows with 3 matches, then rows with 2 matches, then rows with 1 match, then rows with no matches. This is much faster than the query above. However, if I want to add more columns, the query will get really complicated.

CodePudding user response：

It is possible to use an index for your purpose which relies on an array type :

CREATE INDEX IF NOT EXISTS col_index ON mytable USING gin ((array[col1,col2,col3]) array_ops) ;

Then, the following query can use the index col_index for filtering the rows where one or several colomns match the search criteria and then calculate the number of exact matching :

SELECT m.match_count, t.*
  FROM mytable AS t
 CROSS JOIN LATERAL
     ( SELECT count(*) AS match_count
         FROM unnest(array['postgres', 'indexing', 'similarity']) WITH ORDINALITY AS a(arr, id)
        WHERE (array[t.col1, t.col2, t.col3])[a.id] = a.arr
     ) AS m
 WHERE array[t.col1, t.col2, t.col3] && array['postgres', 'indexing', 'similarity']

see the test result in dbfiddle.

CodePudding user response：

Use a simple inverted index: Create and maintain (probably using triggers - an exercise for the reader), a table that stores the tag and the id of the source data, eg:

create table tag_mytable (
    tag text not null,
    mytable_id int not null references mytable,
    primary key (tag, mytable_id)
)

You can then very efficiently find mytable row tag hit count:

with hits as (
    select mytable_id
    from tag_mytable
    where tag = $1
    union all
    select mytable_id
    from tag_mytable
    where tag = $2
    union all
    select mytable_id
    from tag_mytable
    where tag = $3
), hits_total as (
    select mytable_id, count(*) as tag_hits
    from hits
    group by 1
)
select t.*
from mytable t
left join hits_total h on h.mytable_id = t.mytable_id
order by h.tag_hits

This has the added advantage of assigning a hit count of zero to rows without any hits (presumably most of them) without having to compare their data.

Consider bypassing storing the tags in mytable and instead store them only in this new table, which eliminates the need for triggers etc, and allows any number of tags per mytable (albeit with a minor tweak to the query).