Home > Software design >  MySQL 8.0.30 Regular Expression Word Matching with Special Characters
MySQL 8.0.30 Regular Expression Word Matching with Special Characters

Time:08-19

While there's a told of "old" examples on the internet using the now unsupported '[[:<:]]word[[:>:]]' technique, I'm trying to find out how, in MySQL 8.0.30, to do exact word matching from our table with words that have special characters in them.

For example, we have a paragraph of text like:

"Senior software engineer and C   developer with Unit Test and JavaScript experience. I also have .NET experience!"

We have a table of keywords to match against this and have been using the basic system of:

                    SELECT
                        sk.ID
                    FROM
                        sit_keyword sk
                    WHERE
                        var_text REGEXP CONCAT('\\b',sk.keyword,'\\b')

It works fine 90% of the time, but it completely fails on:

C#, C , .NET, A or "A " etc. So it's failing to match keywords with special characters in them.

I can't seem to find any recent documentation on how to address this since, as mentioned, nearly all of the examples I can find use the old unsupported techniques. Note I need to match these words (with special characters) anywhere in the source text, so it can be the first or last word, or somewhere in the middle.

Any advice on the best way to do this using REGEXP would be appreciated.

CodePudding user response:

You need to escape special chars in the search phrase and use the construct that I call "adaptive dynamic word boundaries" instead of word boundaries:

var_text REGEXP CONCAT('(?!\\B\\w)',REGEXP_REPLACE(sk.keyword, '([-.^$* ?()\\[\\]{}\\\\|])', '\\$1'),'(?<!\\w\\B)')

The REGEXP_REPLACE(sk.keyword, '([-.^$* ?()\\[\\]{}\\\\|])', '\\$1') matches . ^ $ * - ? ( ) [ ] { } \ | chars (adds a \ before them) and (?!\\B\\w) / (?<!\\w\\B) require word boundaries only when the search phrase start/ends with a word char.

More details on adaptive dynamic word boundaries and demo in my YT video.

CodePudding user response:

Regular expressions treat several characters as metacharacters. These are documented in the manual on regular expression syntax: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-syntax

If you need a metacharacter to be treated as the literal character, you need to escape it with a backslash.

This gets very complex. If you just want to search for substrings, perhaps you should just use LOCATE():

WHERE LOCATE(sk.keyword, var_text) > 0

This avoids all the trickery with metacharacters. It treats the string of sk.keyword as containing only literal characters.

  • Related