Home > Back-end >  Some character are not identified as character from string in python
Some character are not identified as character from string in python

Time:04-21

I am reading the text from the image using AWS Recognition. Able to get the text properly from the image but I am using a regular expression to check the format of the text eg: the first 5 characters should be alphabet and the next 4 digits only. But it is unable to identify characters as alphabets sometime. Below is the sample data

import re
str= "СТЕРК7383"
re.match("[A-Z]{5}\d{4}",str)

It will return no match found.

If i am trying to lower on this string. It is returning something else.

str= "СТЕРК7383"
str.lower()
  'стерк7383'

Let me know how to solve this problem.

CodePudding user response:

It's because your character are not ascii ones (you can see it when you lower the string).

Using the \w instead of [A-Z] works better. From the documentation:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.

import re

one_str = "СТЕРК7383"
res = re.match(r"\w{5}\d{4}", one_str)
print(res)

Also, str is a built-in work, you should not use it.

CodePudding user response:

A-Z will of course only match uppercase letters from the Latin alphabet. Since your string contains Cyrillic characters, it won't match anything.

To match any Unicode word character, use \w instead:

>>> import re
>>> re.match(r"\w{5}\d{4}", "СТЕРК7383")
<re.Match object; span=(0, 9), match='СТЕРК7383'>
  • Related