Using unaccent with two different rules-CodePudding

German language uses diacritical characters ä, ö, ü. For international use, they are translated into ae, oe, ue (not a, o, u). This mean, that Müller is Mueller on his ID document. This is what we get, when we read the document with (for example) passport reader and this is what we save to the database table.

In next step we search for the records. We do it in two ways:

by entering search data with passport reader (no problem in here)
by entering search data manually

With manual insert there is little problem, because user may enter data international way: 'Mueller' or popular way 'Müller'.

This problem can be solved by using postgres extension Unaccent and modification of unaccent.rules file, so despite is user inserts 'Mueller' or 'Müller', we search in the database for Mueller.

So far so good...

BUT

in the same table we have also other origin's names - for example Turkish ones. Turks translates theirs umlauts (ä, ö, ü) directly into a, o, u, and this way they are saved on the documents, so Müller would by Muller on Turkish document. This causes a problem because (as described before) we search with German unnaccent.rules and we don't find people who we search for.

Long story, but finally question...

... does anybody have any idea how to handle it?
Is there any way, to have two unaccent.rules and use them with or???... for example

Select * from table where last_name = unaccent('Müller' (use German rules)) or last_name = unaccent('Müller' (use Turkish rules))

(I know that what's above does not work, but maybe there is something similar we could use)

regards M

CodePudding user response：

The solution should be simple. Define your German unaccent dictionary (I'll call it entumlauten), then query like

SELECT ...,
       last_name = unaccent('unaccent', 'Müller') AS might_be_turkish,
       last_name = unaccent('entumlauten', 'Müller') AS might_be_german,
FROM tab      
WHERE last_name IN (unaccent('unaccent', 'Müller'),
                    unaccent('entumlauten', 'Müller'))

IN (or (= ANY) will perform better than OR, because it can use an index scan. The additional columns in the SELECT list tell you which condition was matched.

CodePudding user response：

Use soundex() function. This is suitable only for creating lists for human user to pick wanted name. You probably should clean all diacritics (use the Turkish way) before using this.

It also handles similar sounding letters, like C, S and Z or D and T. So Schmidt would match Smith or Jönssen matches Johnson.