I'm using regex word boundary \b, and I'm trying to match a word in the following sentence but the result is not what I need. Connector Punctuations (such as underscore) are not being considered as a word boundary
Sentence: ab﹎cd_de_gf|ij|kl|mn|op_
Regexp: \\bkl\\b
However, de
is not getting matched.
I tried updating the regexp to use unicode connector punctuation (it's product requirement as we support CJK languages as well) but that isn't working.
Regexp: (?<=\\b|[\p{Pc}])de(?=\\b|[\p{Pc}])
What am i missing here?
Note: (?<=\\b|_)de(?=\\b|_)
seems to work for underscores but i need the regex to work for all the connector punctuations.
Thanks in advance !!
CodePudding user response:
Based on the use case you have described you can simplify your regex to:
(?<![[:alnum:]])de(?![[:alnum:]])
instead of trying to match word boundaries, unicode punctuation characters etc.
This will match de
if it not followed or preceded by any alpha-numeric character.
CodePudding user response:
To match any connector punctuation characters you need \p{Pc}
:
(?<=\\b|\\p{Pc})de(?=\\b|\\p{Pc})
NOTE: \p{Pc}
can also be written as [_\u203F\u2040\u2054\uFE33\uFE34\uFE4D-\uFE4F\uFF3F]
that matches all these 10 chars.