Python regular expression for non-latin characters not working-CodePudding

I have some sentences like the following

w = "Chiang Wei-kuo (traditional Chinese: 蔣緯國; simplified Chinese: 蒋纬国; pinyin: Jiǎng Wěiguó, or Wego Chiang; and Nadezhda Nikitichna Mikhalkova (Russian: Надежда Никитична Михалкова;"

and I am using the following regular expression to capture non-latin words (蔣緯國, 蒋纬国, Надежда Никитична Михалкова)

for match in re.finditer(r'(?<=:\s)\W (?=;)', w):
    print(match[0])

So I am trying to capture any non-word characters \W between the symbol : and the symbol ; . But it doesn't seem to be working. I also tried substituting \W with [^a-zA-Z0-9_], but still it doesn't work. Any help on this?

CodePudding user response：

You could also use a flag re.ASCII

import re

w = "Chiang Wei-kuo (traditional Chinese: 蔣緯國; simplified Chinese: 蒋纬国; pinyin: Jiǎng Wěiguó, or Wego Chiang; and Nadezhda Nikitichna Mikhalkova (Russian: Надежда Никитична Михалкова;"
print(re.findall(r'(?<=:\s)\W (?=;)', w, re.ASCII))

Output

['蔣緯國', '蒋纬国', 'Надежда Никитична Михалкова']

CodePudding user response：

You may use:

>>> re.findall(r'(?<=:\s)[^\s\dA-Za-z_][^;]*(?=;)', w)

['蔣緯國', '蒋纬国', 'Надежда Никитична Михалкова']

\W is unicode compliant to match any non-word unicode character only.