More efficient way to query for column permutations in SQLite-CodePudding

For dummy's sake, let's say that I have a database with columns text, ind and sentid. You can think of it as a database with one row per word, with a word's text, its position in the sentence, and the ID of the sentence.

To query for specific occurrences of n words, I join the table with itself n times. That table does not have a unique column except for the default rowid. I oftentimes want to query these columns in such a way that the integers in the n ind columns are sequential without any integer between them. Sometimes the order matters, sometimes the order does not. At the same time, each of the n columns also needs to fulfil some requirement, e.g. n0.text = 'a' AND n1.text = 'b' AND n2.text = 'c'. Put differently, in every sentence (unique sentid), find all occurrences of a b c either ordered or in any order (but sequential).

I have solved the ordered case quite easily. For three columns with names ind0, ind1, ind2 you could simply have a query like the following and scale it accordingly as n columns grows.

WHERE ind1 = ind0   1 AND ind2 = ind1   1

Perhaps there is a better way (do tell me if that is so) but I have found that this works reliably well in terms of performance (query speed).

The tougher nut to crack is those cases where the integers also have to be sequential but where the order does not matter (e.g., 7 9 8, 3 1 2, 11 10 9). My current approach is "brute forcing" by simply generating all possible permutations of orders (e.g., (ind1 = ind0 1 AND ind2 = ind1 1) OR (ind0 = ind1 1 AND ind2 = ind0 1) OR ...)). But as n grows, this becomes a huge list of possibilities and my query speed seems to be really hurting on this. For example, for n=6 (the max requirement) this will generate 720 potential orders separate with OR. Such an approach, that works, is given as a minimal but documented example below for you to try out.

I am looking for a more generic, SQL-y solution that hopefully positively impacts performance when querying for sequential (but not necessarily ordered) columns.

Fiddle with data and current implementation here, reproducible Python code below.

Note that it is possible to get multiple results per sentid, but only if at least one of the word indices differs in the three matched words. E.g. permutations of the results themselves are not needed. E.g., 1-2-0 and 3-2-1 can both be valid results for one sentid, but both 1-2-0 and 2-1-0 can't.

import sqlite3
from itertools import permutations
from pathlib import Path
from random import shuffle


def generate_all_possible_sequences(n: int) -> str:
    """Given an integer, generates all possible permutations of the 'n' given indices with respect to
    order in SQLite. What this means is that it will generate all possible permutations, e.g., for '3':
    0, 1, 2; 0, 2, 1; 1, 0, 2; 1, 2, 0 etc. and then build corresponding SQLite requirements, e.g.,
    0, 1, 2: ind1 = ind0   1 AND ind2 = ind1   1
    0, 2, 1: ind2 = ind0   1 AND ind1 = ind2   1
    ...

    and all these possibilities are then concatenated with OR to allow every possibility:
    ((ind1 = ind0   1 AND ind2 = ind1   1) OR (ind2 = ind0   1 AND ind1 = ind2   1) OR ...)
    """
    idxs = list(range(n))

    order_perms = []
    for perm in permutations(idxs):
        this_perm_orders = []
        for i in range(1, len(perm)):
            this_perm_orders.append(f"w{perm[i]}.ind = w{perm[i-1]}.ind   1")
        order_perms.append(f"({' AND '.join(this_perm_orders)})")

    return f"({' OR '.join(order_perms)})"


def main():
    pdb = Path("temp.db")

    if pdb.exists():
        pdb.unlink()

    conn = sqlite3.connect(str(pdb))
    db_cur = conn.cursor()
    # Create a table of words, where each word has its text, its position in the sentence, and the ID of its sentence
    db_cur.execute("CREATE TABLE tbl(text TEXT, ind INTEGER, sentid INTEGER)")

    # Create dummy data
    vals = []
    for sent_id in range(20):
        shuffled = ["a", "b", "c", "d", "e", "a", "c"]
        shuffle(shuffled)
        for word_id, word in enumerate(shuffled):
            vals.append((word, word_id, sent_id))

    # Wrap the values in single quotes for SQLite
    vals = [(f"'{v}'" for v in val) for val in vals]
    # Convert values into INSERT commands
    cmds = [f"INSERT INTO tbl VALUES ({','.join(val)})" for val in vals]
    # Build DB
    db_cur.executescript(f"BEGIN TRANSACTION;{';'.join(cmds)};COMMIT;")

    print(f"BEGIN TRANSACTION;{';'.join(cmds)};COMMIT;\n")

    # Query DB for sequential occurences in ind0, ind1, and ind2: the order does not matter
    # but they have to be sequential
    query = f"""SELECT w0.ind, w1.ind, w2.ind, w0.sentid
FROM tbl AS w0
JOIN tbl AS w1 USING (sentid)
JOIN tbl AS w2 USING (sentid)
WHERE w0.text = 'a'
    AND w1.text = 'b'
    AND w2.text = 'c'
    AND {generate_all_possible_sequences(3)}"""

    print(query)
    print()

    print("a_idx\tb_idx\tc_idx\tsentid")
    for res in db_cur.execute(query).fetchall():
        print("\t".join(map(str, res)))

    db_cur.close()
    conn.commit()
    conn.close()
    pdb.unlink()


if __name__ == '__main__':
    main()

CodePudding user response：

This is a solution that uses mostly the rowids to create all the possible permutations:

WITH cte AS (
  SELECT t0.rowid rowid0, t1.rowid rowid1, t2.rowid rowid2
  FROM tbl AS t0
  JOIN tbl AS t1 ON t1.sentid = t0.sentid AND t1.ind = t0.ind   1
  JOIN tbl AS t2 ON t2.sentid = t1.sentid AND t2.ind = t1.ind   1
)
SELECT t0.ind ind0, t1.ind ind1, t2.ind ind2 
FROM cte c
JOIN tbl t0 ON t0.rowid IN (c.rowid0, c.rowid1, c.rowid2)
JOIN tbl t1 ON t1.rowid IN (c.rowid0, c.rowid1, c.rowid2) AND t1.rowid <> t0.rowid
JOIN tbl t2 ON t2.rowid IN (c.rowid0, c.rowid1, c.rowid2) AND t2.rowid NOT IN (t0.rowid, t1.rowid)

See a simplified demo.

The query plan (in the above demo) shows that SQLite uses covering indexes and the rowids to perform the joins.
I don't know how this would scale for more columns as this requirement is a performance killer by design because of the multiple joins and the number of rows that must be retuned for each tuple of n columns that satisfy the conditions (=n!).

CodePudding user response：

I found that the easiest way to implement this is to use the following condition:

AND MAX(w0.ind, w1.ind, w2.ind) - MIN(w0.ind, w1.ind, w2.ind) = 2

where 2 = the number of words that we are looking for -1. That being said, it is hard to say much about the performance since "no, indexes can't be used in expressions like MAX(...) - MIN(...) where functions are used." as @forpas mentions.