I have a data table and I would like to filter the columns. For example, to search for a user by his fullname.
However, I would like to allow the user to enter the search phrase case-independent and accents-independent.
So I have checked these (and more) sources and questions:
- https://stackoverflow.com/a/11007216
- How to ignore case sensitive rows in PostgreSQL
- https://www.postgresql.org/docs/current/collation.html#COLLATION-NONDETERMINISTIC
I thought the nondeterministic collations might finally be the right way how to achieve that, but unfortunately I don't know how:
- combine case_insensitive and ignore_accents into one collation
- how to allow searching only by substring in such a WHERE (e.g., find "Jóhn Doe" only by the string "joh") as nondeterministic collations do not support LIKE or regular expressions
- which index to use
I would be very grateful for any advice on how to finally deal with this type of problem.
Thanks!
CodePudding user response:
Creating case and accent insensitive ICU collations is pretty simple:
CREATE COLLATION english_ci_ai (
PROVIDER = icu,
DETERMINISTIC = FALSE,
LOCALE = "en-US-u-ks-level1"
);
Or, equivalently (that syntax also works wil old ICU versions:
CREATE COLLATION english_ci_ai (
PROVIDER = icu,
DETERMINISTIC = FALSE,
LOCALE = "en-US@colStrength=primary"
);
See the ICU documentation for details and my article for a detailed discussion.
But your problem is that you want substring search. So you should create a trigram index:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE EXTENSION IF NOT EXISTS unaccent;
CREATE INDEX ON tab USING gin (unaccent(doc) gin_trgm_ops);
Then you can search like this:
SELECT * FROM tab
WHERE unaccent(doc) ILIKE unaccent('%joh%');
Note that you have to force a minimal length of 4 or so on the search string if you want that to be efficient.