How to efficiently store 1 million words and query them by starts_with, contains, or ends

How do sites like this store tens of thousands of words "containing c", or like this, "words with d and c", or even further, "unscrambling" the word like CAUDK and finding that the database has duck. Curious from an algorithms/efficiency perspective how they would accomplish this:

Would a database be used, or would the words simply be stored in memory and quickly traversed? If a database was used (and each word was a record), how would you make these sorts of queries (with PostgreSQL for example, contains, starts_with, ends_with, and unscrambles)?

I guess the easiest thing to do would be to store all words in memory (sorted?), and just traverse the whole million or less word list to find the matches? But how about the unscramble one?

Basically wondering the efficient way this would be done.

CodePudding user response：

I don't know for sure what they're doing, but I suggest this algorithm for contains and unscramble (and, I think, can be trivially extended to starts with or end with):

User submits a set of letters in the form of a string. Say, user submits bdsfa.
The algorithm sorts that string in (1). So, query becomes abdfs
Then, to find all words with those letters in them, the algorithm simply accesses the directory database/a/b/d/f/s/ and finds all words with those letters in. In case it finds the directory to be empty, it goes one level up: database/a/b/d/f/ and shows result there.

So, now, the question is, how to index the database of millions of words as done in step (3)? database/ directory will have 26 directories inside it for a to z, each of which will have 26-1 directories for all letters, except their parent's. E.g.:

database/a/{b,c,...,z}`
database/b/{a,c,...,z}`
...
database/z/{a,c,...,y}`

This tree structure will be only 26 level deep. Each branch will have no more than 26 elements. So browsing this directory structure is scalable.

Words will be stored in the leaves of this tree. So, the word apple will be stored in database/a/e/l/p/leaf_apple. In that place, you will also find other words such as leap. More specifically:

database/
  a/
    e/
      l/
        p/
          leaf_apple
          leaf_leap
          leaf_peal
          ...

This way, you can efficiently reach the subset of target words as O(log n), where n is total number of words in your database.

You can further optimise this by adding additional indices. For example, there are too many words containing a, and the website won't display them all (at least not in the 1st page). Instead, the website may say there are total 500,000 many words containing 'a', here is 100 examples. In order to obtain 500,000 efficiently, the number of children at every level can be added during the indexing. E.g. `database/{a,b,...,z}/{num_children,

`database/
  {...}/
    {num_children,...}/
      {num_childred,...}/
        ...

Here, num_children is just a leaf node, just like leaf_WORD. All leafs are files.

Depending on the load that this website has, it may not require to load this database in memory. It may simply leave it to the operating system to decide which portion of its file system to cache in memory as a read-time optimisation.

Personally, I think, as a criticism to applications, I think developers tend to jump into requiring RAM too fast even when a simple file system trick can do the job without any noticeable difference to the end user.

CodePudding user response：

"Containing C" amounts to count(C)==1. Unscrambling CAUDC amounts to count(C) <= 2 && count(A) <= 1 && count(U) <= 1 && count(D) <= 1. So both queries could be efficiently answered by a database with 26 indices, one for the count of each letter in the alphabet.

Here is a quick and dirty python sqlite3 demo:

from collections import Counter
import sqlite3

conn = sqlite3.connect(':memory:')
cur = conn.cursor()

alphabet = [chr(ord('a') i) for i in range(26)]
alphabet_set = set(alphabet)
columns = ['word text']   [f'{c}_count TINYINT DEFAULT 0' for c in alphabet]
create_cmd = f'CREATE TABLE abc ({", ".join(columns)})'
cur.execute(create_cmd)

for c in alphabet:
    cur.execute(f'CREATE INDEX {c}_index ON abc ({c}_count)')

def insert(word):
    counts = Counter(word)
    columns = ['word']   [f'{c}_count' for c in counts.keys()]
    counts = [f'"{word}"']   [f'{n}' for n in counts.values()]
    insert_cmd = f'INSERT INTO abc ({", ".join(columns)}) VALUES ({", ".join(counts)})'
    cur.execute(insert_cmd)

def unscramble(text):
    counts = {a:0 for a in alphabet}
    for c in text:
        counts[c]  = 1

    where_clauses = [f'{c}_count <= {n}' for (c, n) in counts.items()]
    select_cmd = f'SELECT word FROM abc WHERE {" AND ".join(where_clauses)}'
    cur.execute(select_cmd)
    return list(sorted([tup[0] for tup in cur.fetchall()]))

print('Building sqlite table...')
for line in open('/usr/share/dict/words'):
    word = line.strip().lower()
    if all(c in alphabet_set for c in word):
        insert(word)
print('Table built!')

print("unscramble('caudk'): ", unscramble('caudk'))

Output:

Building sqlite table...
Table built!
unscramble('caudk'):  ['a', 'a', 'ac', 'ac', 'ad', 'ak', 'au', 'auk', 'c', 'c', 'ca', 'ca', 'ca', 'cad', 'cd', 'cd', 'cu', 'cu', 'cud', 'd', 'd', 'da', 'dc', 'duck', 'k', 'k', 'kc', 'u', 'u', 'uk']