Is there better regex to calculate the number of Chinese letters and exclude some characters at the-CodePudding

I want to calculate the number of Chinese letters and exclude some characters, for example,

s_l = ['康熙十年','咸丰三年','民国二十二年']

need to exempt ‘年’ character,so

s_l = ['康熙十年','咸丰三年','民国二十二年']
for idx, str_item in enumerate(s_l):
    res = len(re.findall(r'[\u4E00-\u9FFF]', str_item))-len(re.findall(r'[年]', str_item))
    print(res)

Now, can I combine these two regex to one? If so, how? It is not easy to combine directly to

re.findall(r'[\u4E00-\u9FFF]((?![年]).)*$', str_item)

CodePudding user response：

Without regex:

exclude_list = list('?!.)ab')
for str_item in s_l:
    res = len([i for i in str_item if i not in exclude_list])
    print(f"{str_item}: {res}")

Output:

abc)def: 4
aaabbbccc: 3
dfg: 3

With a regex:

for str_item in s_l:
    res = len(re.findall(r'[^?!.)ab]', str_item))
    print(res)

CodePudding user response：

It will become much simpler if you pip install regex and then use

import regex
s_l = ['康熙十年','咸丰三年','民国二十二年', 'abc']
rx = regex.compile(r'[^\P{Han}年]')
print( [len(rx.findall(s)) for s in s_l] )
# => [3, 3, 5, 0]

See the Python demo and the regex demo. The [^\P{Han}年] regex matches any Chinese chars other than 年.

The re compliant pattern is

(?!\u5E74)[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFA6D\uFA70-\uFAD9\U00016FE2\U00016FE3\U00016FF0\U00016FF1\U00020000-\U0002A6DF\U0002A700-\U0002B738\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002F800-\U0002FA1D\U00030000-\U0003134A]

See the regex demo. See the Python demo:

import re
s_l = ['康熙十年','咸丰三年','民国二十二年', 'abc']
rx = re.compile(r'(?!\u5E74)[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFA6D\uFA70-\uFAD9\U00016FE2\U00016FE3\U00016FF0\U00016FF1\U00020000-\U0002A6DF\U0002A700-\U0002B738\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002F800-\U0002FA1D\U00030000-\U0003134A]')
print( [len(rx.findall(s)) for s in s_l] )
# => [3, 3, 5, 0]