Python regex module "re" match unicode characters with \u-CodePudding

I am trying to identify and replace unicode characters from strings that I am processing to make keyword match filters.

For example, given the string

"Apple iPhone 12 mini A2176 128GB\u00a0(PRODUCT) Red!\u00a0Perfect condition! Unlocked!"

I want the output from when I use the re.sub function (replace the pattern with blank space " ") to be

"Apple iPhone 12 mini A2176 128GB (PRODUCT) Red! Perfect condition! Unlocked!"

So I went to a regex build and test website and came up with this pattern

\\u[a-z|0-9]{4}

Which captures the 2 unicode strings

\u00a0 and \u00a0

Now trying to apply it to my python code I first tried this snippet. Here I use the findall function to see if the code would return the unicode strings

import re

strin = "Apple iPhone 12 mini A2176 128GB\u00a0(PRODUCT) Red!\u00a0Perfect condition! Unlocked!"


print(re.findall('\\u[a-z|0-9]{4}', strin))

which causes the following error to return

re.error: incomplete escape \u at position 0

I then tried adding an 'r' in front of the string pattern. No error appears but there is no unicode string returned

print(re.findall(r'\\u[a-z|0-9]{4}', strin))

output is an empty list [] I then tried the same 2 approaches but with only 1 backslash

print(re.findall('\u[a-z|0-9]{4}', strin)) gives SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape

print(re.findall(r'\u[a-z|0-9]{4}', strin)) gives 
re.error: incomplete escape \u at position 0

CodePudding user response：

You have multiple misunderstandings here (all of which are as such common FAQs).

The argument to re.findall is a string. In Python, backslashes in strings have to be escaped by doubling them. A better solution is to use the r"..." raw string notation, especially for regular expressions, which often need to contain literal backslashes for the actual regular expressions.

The error message you get from findall tells you that the character escape \u[ is incorrect because [ is not a hexadecimal number. (In fact, even if your regex wasn't syntactically incorrect, it matches way too much; the regex for a Unicode character escape in Python would be \\u[0-9a-f]{4}, not a-z.)

The character \u00a0 is a single Unicode glyph, containing a single character in the string. You can't match it with a regex like that. What you can match is e.g.

re.findall(r'[\u0080-\uffef]', strin)

which contains a character class covering the range of non-ASCII characters in the Unicode Basic Multilingual Plane (including surrogates, which properly speaking we should exclude, but let's not go there for a beginner question. Maybe also note that there are Unicode characters outside the BMP, which can be matched with [\U00010000-\U0010FFFF]).

(Tangentally, notice also that the character class [a-z|0-9] includes the literal character | in the character class. The | stands for alternation outside character classes, but inside [...] everything except an initial ^ and - is just a literal character.)

But more fundamentally, the beginner reaction to "I don't understand this Unicode stuff" is wrong; the response should be "I need to understand this stuff", not "I need to remove it". There is rarely a good case for simply removing all Unicode, and the tendency is only dragging you back into the dark ages before Unicode when you could only represent English text (and barely that) in Western computers.

A more principled solution to this specific problem is to canonicalize all whitespace characters (perhaps except tabs) to an ASCII space, and figure out how to tackle other Unicode characters as you bump into them. What makes sense depends hugely on your specific application. For search or NLP, it might make sense to canonicalize or "flatten" all text to a near-ASCII subset, but for many other applications, you usually need something a bit more nuanced.

With that out of the way, try

Python 3.8.2 (default, May 18 2021, 11:47:11) 
[Clang 12.0.5 (clang-1205.0.22.9)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> strin = "Apple iPhone 12 mini A2176 128GB\u00a0(PRODUCT) Red!\u00a0Perfect condition! Unlocked!"
>>> import re
>>> re.sub(r'\s', ' ', strin)
'Apple iPhone 12 mini A2176 128GB (PRODUCT) Red! Perfect condition! Unlocked!'

CodePudding user response：

If your purpose is to just remove unicode from your text then you are working way too hard. You can do it simple with

strin.encode('ascii', 'ignore').decode('ascii')

You encode your string as ascii and ignore the errors, then you decode it again as ascii thus removing all the non ascii characters