I have some sentences like the following
w = "Chiang Wei-kuo (traditional Chinese: 蔣緯國; simplified Chinese: 蒋纬国; pinyin: Jiǎng Wěiguó, or Wego Chiang; and Nadezhda Nikitichna Mikhalkova (Russian: Надежда Никитична Михалкова;"
and I am using the following regular expression to capture non-latin words (蔣緯國, 蒋纬国, Надежда Никитична Михалкова)
for match in re.finditer(r'(?<=:\s)\W (?=;)', w):
print(match[0])
So I am trying to capture any non-word characters \W
between the symbol :
and the symbol ;
. But it doesn't seem to be working. I also tried substituting \W
with [^a-zA-Z0-9_]
, but still it doesn't work. Any help on this?
CodePudding user response:
You could also use a flag re.ASCII
import re
w = "Chiang Wei-kuo (traditional Chinese: 蔣緯國; simplified Chinese: 蒋纬国; pinyin: Jiǎng Wěiguó, or Wego Chiang; and Nadezhda Nikitichna Mikhalkova (Russian: Надежда Никитична Михалкова;"
print(re.findall(r'(?<=:\s)\W (?=;)', w, re.ASCII))
Output
['蔣緯國', '蒋纬国', 'Надежда Никитична Михалкова']
CodePudding user response:
You may use:
>>> re.findall(r'(?<=:\s)[^\s\dA-Za-z_][^;]*(?=;)', w)
['蔣緯國', '蒋纬国', 'Надежда Никитична Михалкова']
\W
is unicode compliant to match any non-word unicode character only.