Here is a text with English words, CJK characters and fullwidth parenthesis(\uff08
and \uff09
):
这是(一段测试)文字(start开始end)的结果
I want to split the text into words, for CJK characters, one charcater is a word. The special point is that I also want the fullwidth left parenthesis \uff08
combines with the word after it, and the fullwidth right parenthesis \uff09
combines with the word before it.
The expected result will be:
这
是
(一
段
测
试)
文
字
(start
开
始
end)
的
结
果
Currently, I use new Regex(@"(\s )|([\u0000-\u001F\u0021-\u007F] )|([^\u0000-\u007F])");
to split the text, but fullwidth parentheses didn't combine with the word before/after it.
CodePudding user response:
You can add those special cases:
(\uff08(?:[^\u0000-\u007F]|[\u0021-\u007f] ))
and
((?:[^\u0000-\u007F]|[\u0021-\u007f] )\uff09)
to your regex, giving you a complete regex of:
(\s )|(\uff08(?:[^\u0000-\u007F]|[\u0021-\u007f] ))|((?:[^\u0000-\u007F]|[\u0021-\u007f] )\uff09)|([\u0000-\u001F\u0021-\u007F] )|([^\u0000-\u007F])
Note they need to be added to the regex prior to the part of the regex that could match the word on its own, otherwise that match will take precedence.