Home > Software engineering >  How split text with parenthesis into words using regular expression
How split text with parenthesis into words using regular expression

Time:09-23

Here is a text with English words, CJK characters and fullwidth parenthesis(\uff08 and \uff09):

这是(一段测试)文字(start开始end)的结果

I want to split the text into words, for CJK characters, one charcater is a word. The special point is that I also want the fullwidth left parenthesis \uff08 combines with the word after it, and the fullwidth right parenthesis \uff09 combines with the word before it.

The expected result will be:

这
是
(一
段
测
试)
文
字
(start
开
始
end)
的
结
果

Currently, I use new Regex(@"(\s )|([\u0000-\u001F\u0021-\u007F] )|([^\u0000-\u007F])"); to split the text, but fullwidth parentheses didn't combine with the word before/after it.

CodePudding user response:

You can add those special cases:

(\uff08(?:[^\u0000-\u007F]|[\u0021-\u007f] ))

and

((?:[^\u0000-\u007F]|[\u0021-\u007f] )\uff09)

to your regex, giving you a complete regex of:

(\s )|(\uff08(?:[^\u0000-\u007F]|[\u0021-\u007f] ))|((?:[^\u0000-\u007F]|[\u0021-\u007f] )\uff09)|([\u0000-\u001F\u0021-\u007F] )|([^\u0000-\u007F])

Demo on regex101

Note they need to be added to the regex prior to the part of the regex that could match the word on its own, otherwise that match will take precedence.

  • Related