I want to calculate the number of Chinese letters and exclude some characters, for example,
s_l = ['康熙十年','咸丰三年','民国二十二年']
need to exempt ‘年’ character,so
s_l = ['康熙十年','咸丰三年','民国二十二年']
for idx, str_item in enumerate(s_l):
res = len(re.findall(r'[\u4E00-\u9FFF]', str_item))-len(re.findall(r'[年]', str_item))
print(res)
Now, can I combine these two regex to one? If so, how? It is not easy to combine directly to
re.findall(r'[\u4E00-\u9FFF]((?![年]).)*$', str_item)
CodePudding user response:
Without regex:
exclude_list = list('?!.)ab')
for str_item in s_l:
res = len([i for i in str_item if i not in exclude_list])
print(f"{str_item}: {res}")
Output:
abc)def: 4
aaabbbccc: 3
dfg: 3
With a regex:
for str_item in s_l:
res = len(re.findall(r'[^?!.)ab]', str_item))
print(res)
CodePudding user response:
It will become much simpler if you pip install regex
and then use
import regex
s_l = ['康熙十年','咸丰三年','民国二十二年', 'abc']
rx = regex.compile(r'[^\P{Han}年]')
print( [len(rx.findall(s)) for s in s_l] )
# => [3, 3, 5, 0]
See the Python demo and the regex demo. The [^\P{Han}年]
regex matches any Chinese chars other than 年
.
The re
compliant pattern is
(?!\u5E74)[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFA6D\uFA70-\uFAD9\U00016FE2\U00016FE3\U00016FF0\U00016FF1\U00020000-\U0002A6DF\U0002A700-\U0002B738\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002F800-\U0002FA1D\U00030000-\U0003134A]
See the regex demo. See the Python demo:
import re
s_l = ['康熙十年','咸丰三年','民国二十二年', 'abc']
rx = re.compile(r'(?!\u5E74)[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFA6D\uFA70-\uFAD9\U00016FE2\U00016FE3\U00016FF0\U00016FF1\U00020000-\U0002A6DF\U0002A700-\U0002B738\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002F800-\U0002FA1D\U00030000-\U0003134A]')
print( [len(rx.findall(s)) for s in s_l] )
# => [3, 3, 5, 0]