I am reading the text from the image using AWS Recognition. Able to get the text properly from the image but I am using a regular expression to check the format of the text eg: the first 5 characters should be alphabet and the next 4 digits only. But it is unable to identify characters as alphabets sometime. Below is the sample data
import re
str= "СТЕРК7383"
re.match("[A-Z]{5}\d{4}",str)
It will return no match found.
If i am trying to lower on this string. It is returning something else.
str= "СТЕРК7383"
str.lower()
'стерк7383'
Let me know how to solve this problem.
CodePudding user response:
It's because your character are not ascii ones (you can see it when you lower the string).
Using the \w
instead of [A-Z]
works better. From the documentation:
Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
import re
one_str = "СТЕРК7383"
res = re.match(r"\w{5}\d{4}", one_str)
print(res)
Also, str
is a built-in work, you should not use it.
CodePudding user response:
A-Z
will of course only match uppercase letters from the Latin alphabet. Since your string contains Cyrillic characters, it won't match anything.
To match any Unicode word character, use \w
instead:
>>> import re
>>> re.match(r"\w{5}\d{4}", "СТЕРК7383")
<re.Match object; span=(0, 9), match='СТЕРК7383'>