Say we have two tables in PostgreSQL:
words
withtext
field containing the word (English word for now), andid
.ngrams
containingtext
field and mapping toword_id
.
How would you "select all words which contain x and y"?
select * from words
inner join ngrams on ngrams.word_id = words.id
where ngrams.text = x
and ngrams.text = y
The general problem is, how do I do an AND
but use two different records from the ngrams
table? What I did just there is compare the same ngram text value to two different variables, asking for AND. That doesn't make sense, it should somehow use two different ngram records. How generally do you do that in PostgreSQL?
Note, I don't want to use PostgreSQL full-text search functionality. I am trying to understand how to implement it at a lower level (manually), and also want to apply to hundreds of languages beyond just English (like to Chinese, which has at least 80k characters).
https://medium.com/@evanwhalen/anatomy-of-an-n-gram-search-b30c6f20ad39
CodePudding user response:
You can use two exists
subqueries:
select w.* from words where exists (select 1 from ngrams n
where n.word_id = w.id and n.text = x) -- check if the word has associated ngram of x
and exists (select 1 from ngrams n
where n.word_id = w.id and n.text = y) -- also ensure that the word has associated ngram of y
CodePudding user response:
This is a well-phrased question.
What you're looking for is a Self Join.
So we query ngrams
once, calling it e.g. a
,
and query it a 2nd time, calling it b
.
Then the WHERE clause looks for a.text = x AND b.text = y
.
(I prolly would have named the column ngram
instead, whatever.)
Let's look for just the ngrams, no word.
select *
from ngrams a
join ngrams b
where a.text = x
and b.text = y
and a.text <= b.text;
Additionally, we only want the lower triangle
of this Cartesian cross product,
so constrain a.text <= b.text
.
No need to retrieve both ("th", "he") and ("he", "th")
for "the".
Set up your query beforehand so that x <= y
obtains.
With the ngrams in hand, it's straightforward
to additionally JOIN against words
and maybe use GROUP BY to boil them down to distinct words.
Tacking on an ORDER BY, at least during debugging, will make the result set a bit easier to read.