PostgreSQL case-insensitive and accent-insensitive search-CodePudding

I have a data table and I would like to filter the columns. For example, to search for a user by his fullname.

However, I would like to allow the user to enter the search phrase case-independent and accents-independent.

So I have checked these (and more) sources and questions:

I thought the nondeterministic collations might finally be the right way how to achieve that, but unfortunately I don't know how:

combine case_insensitive and ignore_accents into one collation
how to allow searching only by substring in such a WHERE (e.g., find "Jóhn Doe" only by the string "joh") as nondeterministic collations do not support LIKE or regular expressions
which index to use

I would be very grateful for any advice on how to finally deal with this type of problem.

Thanks!

CodePudding user response：

Creating case and accent insensitive ICU collations is pretty simple:

CREATE COLLATION english_ci_ai (
   PROVIDER = icu,
   DETERMINISTIC = FALSE,
   LOCALE = "en-US-u-ks-level1"
);

Or, equivalently (that syntax also works wil old ICU versions:

CREATE COLLATION english_ci_ai (
   PROVIDER = icu,
   DETERMINISTIC = FALSE,
   LOCALE = "en-US@colStrength=primary"
);

See the ICU documentation for details and my article for a detailed discussion.

But your problem is that you want substring search. So you should create a trigram index:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE EXTENSION IF NOT EXISTS unaccent;

CREATE INDEX ON tab USING gin (unaccent(doc) gin_trgm_ops);

Then you can search like this:

SELECT * FROM tab
WHERE unaccent(doc) ILIKE unaccent('%joh%');

Note that you have to force a minimal length of 4 or so on the search string if you want that to be efficient.