I am testing a simple search mechanism to handle SMALL typos/misspellings. Similar to a autocorrect mechanism.
I am struggling a lot with this. So I am creating a function (pl/pgsql) to handle this, and I am running it on SUPABASE.IO, PostgreSQL 13.3 (similar to RDS).
I would like to:
- LIMIT the returned results to only the highly similar email addresses, say similarity > 0.7;
- Use an INDEX as the actual list of emails will be in the order of tens of millions, so it has to return in under a second.
DROP TABLE IF EXISTS email;
CREATE TABLE email (
email_address TEXT NOT NULL UNIQUE,
person_id UUID NOT NULL,
CONSTRAINT email_pk PRIMARY KEY (email_address)
);
DROP INDEX IF EXISTS email_address_trigram_idx;
CREATE INDEX email_address_trigram_idx ON email USING gin(email_address gin_trgm_ops);
INSERT INTO email(email_address, person_id) VALUES
('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4())
, ('[email protected]', uuid_generate_v4());
SET pg_trgm.similarity_threshold = 0.8; -- This doesn't seem to affect my queries
SELECT *, similarity('[email protected]', email_address)
FROM email
WHERE email_address % '[email protected]';
I want a way of searching quickly and still tolerate some minor typos in the search.
CodePudding user response:
First off, your table definition creates two unique indices on (email_address)
. Don't. Drop the UNIQUE
constraint, keep the PK:
CREATE TABLE email (
email_address text PRIMARY KEY
, person_id uuid NOT NULL -- bigint?
);
(Also not sure why you would need uuid
for person_id
. There aren't nearly enough people in the world to justify more than a bigint
.)
Next, since you want to ...
LIMIT the returned results to only the highly similar email addresses,
I suggest a nearest neighbor search. Create a GiST index for this purpose instead of the GIN:
CREATE INDEX email_address_trigram_gist_idx ON email USING gist (email_address gist_trgm_ops);
And use a query like this:
SELECT *, similarity('[email protected]', email_address)
FROM email
WHERE email_address % '[email protected]'
ORDER BY email_address <-> '[email protected]' -- note the use of the operator <->
LIMIT 10;
Quoting the manual:
This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.
While working with a small LIMIT
, there is probably no need to set pg_trgm.similarity_threshold
very high because this query gives you the best matches first.
Related: