How to find all words containing ngram x and ngram y in PostgreSQL (using manual implementation of n-CodePudding

Say we have two tables in PostgreSQL:

words with text field containing the word (English word for now), and id.
ngrams containing text field and mapping to word_id.

How would you "select all words which contain x and y"?

select * from words
inner join ngrams on ngrams.word_id = words.id
where ngrams.text = x
and ngrams.text = y

The general problem is, how do I do an AND but use two different records from the ngrams table? What I did just there is compare the same ngram text value to two different variables, asking for AND. That doesn't make sense, it should somehow use two different ngram records. How generally do you do that in PostgreSQL?

Note, I don't want to use PostgreSQL full-text search functionality. I am trying to understand how to implement it at a lower level (manually), and also want to apply to hundreds of languages beyond just English (like to Chinese, which has at least 80k characters).

https://medium.com/@evanwhalen/anatomy-of-an-n-gram-search-b30c6f20ad39

CodePudding user response：

You can use two exists subqueries:

select w.* from words where exists (select 1 from ngrams n 
    where n.word_id = w.id and n.text = x) -- check if the word has associated ngram of x
  and exists (select 1 from ngrams n 
    where n.word_id = w.id and n.text = y) -- also ensure that the word has associated ngram of y

CodePudding user response：

This is a well-phrased question.

What you're looking for is a Self Join. So we query ngrams once, calling it e.g. a, and query it a 2nd time, calling it b. Then the WHERE clause looks for a.text = x AND b.text = y. (I prolly would have named the column ngram instead, whatever.)

Let's look for just the ngrams, no word.

select *
from ngrams a
join ngrams b
where a.text = x
      and b.text = y
      and a.text <= b.text;

Additionally, we only want the lower triangle of this Cartesian cross product, so constrain a.text <= b.text. No need to retrieve both ("th", "he") and ("he", "th") for "the". Set up your query beforehand so that x <= y obtains.

With the ngrams in hand, it's straightforward to additionally JOIN against words and maybe use GROUP BY to boil them down to distinct words.

Tacking on an ORDER BY, at least during debugging, will make the result set a bit easier to read.