Home > database >  Is there better regex to calculate the number of Chinese letters and exclude some characters at the
Is there better regex to calculate the number of Chinese letters and exclude some characters at the

Time:05-12

I want to calculate the number of Chinese letters and exclude some characters, for example,

s_l = ['康熙十年','咸丰三年','民国二十二年']

need to exempt ‘年’ character,so

s_l = ['康熙十年','咸丰三年','民国二十二年']
for idx, str_item in enumerate(s_l):
    res = len(re.findall(r'[\u4E00-\u9FFF]', str_item))-len(re.findall(r'[年]', str_item))
    print(res)

Now, can I combine these two regex to one? If so, how? It is not easy to combine directly to

re.findall(r'[\u4E00-\u9FFF]((?![年]).)*$', str_item)

CodePudding user response:

Without regex:

exclude_list = list('?!.)ab')
for str_item in s_l:
    res = len([i for i in str_item if i not in exclude_list])
    print(f"{str_item}: {res}")

Output:

abc)def: 4
aaabbbccc: 3
dfg: 3

With a regex:

for str_item in s_l:
    res = len(re.findall(r'[^?!.)ab]', str_item))
    print(res)

CodePudding user response:

It will become much simpler if you pip install regex and then use

import regex
s_l = ['康熙十年','咸丰三年','民国二十二年', 'abc']
rx = regex.compile(r'[^\P{Han}年]')
print( [len(rx.findall(s)) for s in s_l] )
# => [3, 3, 5, 0]

See the Python demo and the regex demo. The [^\P{Han}年] regex matches any Chinese chars other than .

The re compliant pattern is

(?!\u5E74)[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFA6D\uFA70-\uFAD9\U00016FE2\U00016FE3\U00016FF0\U00016FF1\U00020000-\U0002A6DF\U0002A700-\U0002B738\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002F800-\U0002FA1D\U00030000-\U0003134A]

See the regex demo. See the Python demo:

import re
s_l = ['康熙十年','咸丰三年','民国二十二年', 'abc']
rx = re.compile(r'(?!\u5E74)[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFA6D\uFA70-\uFAD9\U00016FE2\U00016FE3\U00016FF0\U00016FF1\U00020000-\U0002A6DF\U0002A700-\U0002B738\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002F800-\U0002FA1D\U00030000-\U0003134A]')
print( [len(rx.findall(s)) for s in s_l] )
# => [3, 3, 5, 0]
  • Related