I am trying to identify and replace unicode characters from strings that I am processing to make keyword match filters.
For example, given the string
"Apple iPhone 12 mini A2176 128GB\u00a0(PRODUCT) Red!\u00a0Perfect condition! Unlocked!"
I want the output from when I use the re.sub function (replace the pattern with blank space " ") to be
"Apple iPhone 12 mini A2176 128GB (PRODUCT) Red! Perfect condition! Unlocked!"
So I went to a regex build and test website and came up with this pattern
\\u[a-z|0-9]{4}
Which captures the 2 unicode strings
\u00a0 and \u00a0
Now trying to apply it to my python code I first tried this snippet. Here I use the findall
function to see if the code would return the unicode strings
import re
strin = "Apple iPhone 12 mini A2176 128GB\u00a0(PRODUCT) Red!\u00a0Perfect condition! Unlocked!"
print(re.findall('\\u[a-z|0-9]{4}', strin))
which causes the following error to return
re.error: incomplete escape \u at position 0
I then tried adding an 'r' in front of the string pattern. No error appears but there is no unicode string returned
print(re.findall(r'\\u[a-z|0-9]{4}', strin))
output is an empty list []
I then tried the same 2 approaches but with only 1 backslash
print(re.findall('\u[a-z|0-9]{4}', strin))
gives SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
print(re.findall(r'\u[a-z|0-9]{4}', strin)) gives
re.error: incomplete escape \u at position 0
CodePudding user response:
You have multiple misunderstandings here (all of which are as such common FAQs).
The argument to re.findall
is a string. In Python, backslashes in strings have to be escaped by doubling them. A better solution is to use the r"..."
raw string notation, especially for regular expressions, which often need to contain literal backslashes for the actual regular expressions.
The error message you get from findall
tells you that the character escape \u[
is incorrect because [
is not a hexadecimal number. (In fact, even if your regex wasn't syntactically incorrect, it matches way too much; the regex for a Unicode character escape in Python would be \\u[0-9a-f]{4}
, not a-z
.)
The character \u00a0
is a single Unicode glyph, containing a single character in the string. You can't match it with a regex like that. What you can match is e.g.
re.findall(r'[\u0080-\uffef]', strin)
which contains a character class covering the range of non-ASCII characters in the Unicode Basic Multilingual Plane (including surrogates, which properly speaking we should exclude, but let's not go there for a beginner question. Maybe also note that there are Unicode characters outside the BMP, which can be matched with [\U00010000-\U0010FFFF]
).
(Tangentally, notice also that the character class [a-z|0-9]
includes the literal character |
in the character class. The |
stands for alternation outside character classes, but inside [
...]
everything except an initial ^
and -
is just a literal character.)
But more fundamentally, the beginner reaction to "I don't understand this Unicode stuff" is wrong; the response should be "I need to understand this stuff", not "I need to remove it". There is rarely a good case for simply removing all Unicode, and the tendency is only dragging you back into the dark ages before Unicode when you could only represent English text (and barely that) in Western computers.
A more principled solution to this specific problem is to canonicalize all whitespace characters (perhaps except tabs) to an ASCII space, and figure out how to tackle other Unicode characters as you bump into them. What makes sense depends hugely on your specific application. For search or NLP, it might make sense to canonicalize or "flatten" all text to a near-ASCII subset, but for many other applications, you usually need something a bit more nuanced.
With that out of the way, try
Python 3.8.2 (default, May 18 2021, 11:47:11)
[Clang 12.0.5 (clang-1205.0.22.9)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> strin = "Apple iPhone 12 mini A2176 128GB\u00a0(PRODUCT) Red!\u00a0Perfect condition! Unlocked!"
>>> import re
>>> re.sub(r'\s', ' ', strin)
'Apple iPhone 12 mini A2176 128GB (PRODUCT) Red! Perfect condition! Unlocked!'
CodePudding user response:
If your purpose is to just remove unicode from your text then you are working way too hard. You can do it simple with
strin.encode('ascii', 'ignore').decode('ascii')
You encode your string as ascii and ignore the errors, then you decode it again as ascii thus removing all the non ascii characters