I'm trying to get all Chinese sentences from strings with addtional group of chars like [NAME] and [PLACE].
I have this string
<DisplayName>凡人战争</DisplayName>
<Desc>[NAME]赶到[PLACE],发现战火正燃,此地百姓饱受战争之苦。</Desc>
<Display>劝停战争</Display>
<OKResult><![CDATA[me:AddMsg(XT("[NAME]以仙法摄走两军首领,一番劝戒,迫使他们停止了战争 ...
and I want find
凡人战争
[NAME]赶到[PLACE],发现战火正燃,此地百姓饱受战争之苦
[NAME]以仙法摄走两军首领,一番劝戒,迫使他们停止了战争,消弭了这场祸事
此举手段温和,虽无人知晓,但却顺应天道,[NAME]获得了一些功德
I know for chinese chars regex is [\u4e00-\u9fff\uFF0C]
and for group chars (\u005BNAME\u005D)
and (\u005BPLACE\u005D)
but how to combine this.
I try this way written in python
Array_of_words = re.findall(r'[\u4e00-\u9fff\uFF0C(\u005BNAME\u005D)(\u005BPLACE\u005D)] ', text)
But additionally marks single letters and brackets like this:
['N', 'N', '凡人战争', 'N', '[NAME]赶到[PLACE],发现战火正燃,此地百姓饱受战争之苦', '劝停战争', '[C', 'A', 'A[', 'A', 'M', '(', '(', '[NAME]以仙法摄走两军首领,一番劝戒,迫使他们停止了战争,消弭了这场祸事', '此举手段温和,虽无人知晓,但却顺应天道,[NAME]获得了一些功德', '))', 'A', 'P', '(', '(', '))', '()', ']]']
CodePudding user response:
You can use
re.findall(r'(?:\[(?:PLACE|NAME)]|[\u4e00-\u9fff\uFF0C]) ', text)
Details
(?:
- start of a non-capturing group:\[(?:PLACE|NAME)]
-[
, then eitherPLACE
orNAME
and then]
|
- or[\u4e00-\u9fff\uFF0C]
- a Chinese char pattern of yours
)
- end of the group, match one or more occurrences.
To match any uppercase ASCII letters inside square brackets, replace \[(?:PLACE|NAME)]
with \[[A-Z] ]
.