Some character are not identified as character from string in python-CodePudding

I am reading the text from the image using AWS Recognition. Able to get the text properly from the image but I am using a regular expression to check the format of the text eg: the first 5 characters should be alphabet and the next 4 digits only. But it is unable to identify characters as alphabets sometime. Below is the sample data

import re
str= "СТЕРК7383"
re.match("[A-Z]{5}\d{4}",str)

It will return no match found.

If i am trying to lower on this string. It is returning something else.

str= "СТЕРК7383"
str.lower()
  'стерк7383'

Let me know how to solve this problem.

CodePudding user response：

It's because your character are not ascii ones (you can see it when you lower the string).

Using the \w instead of [A-Z] works better. From the documentation:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

import re

one_str = "СТЕРК7383"
res = re.match(r"\w{5}\d{4}", one_str)
print(res)

Also, str is a built-in work, you should not use it.

CodePudding user response：

A-Z will of course only match uppercase letters from the Latin alphabet. Since your string contains Cyrillic characters, it won't match anything.

To match any Unicode word character, use \w instead:

>>> import re
>>> re.match(r"\w{5}\d{4}", "СТЕРК7383")
<re.Match object; span=(0, 9), match='СТЕРК7383'>