Home > Blockchain >  Python: Finding unicode characters within string using regex
Python: Finding unicode characters within string using regex

Time:12-01

I am attempting to filter out unicode characters from a json (converted to a string) using regex in python, but can't seem to write the re.compile() method correctly as it is throwing many errors.

Here is the code:

    regex = re.compile("\u....")
    string = json.dumps(json)
    matches = re.findall(regex, string)
    print(matches)

This is producing this error:

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape

I have tried re-writing it as (r"\u...."), ("\u....") and (r"\u....") and none of these have been successful and given me the error:

re.error: incomplete escape \u at position 0

What is the correct way to get a regex of unicode characters to search a string? Thank you.

CodePudding user response:

In Python, escape sequences are indicated by a backslash (\).

You must escape backslashes themselves if you wish to use them in your string.

Try adding a "\" character before your string:

regex = re.compile("\\u....")

CodePudding user response:

Both Python and the regex engine evaluate the string so you need to quadruple the backslashes, or use a raw string and double them. It's also good to be more explicit about the form of the Unicode escape. As you can see below it can capture more than intended:

import json
import re

s = r'Test: c:\user\马克'
regex1 = re.compile(r'\\u....')
regex2 = re.compile(r'\\u[0-9a-fA-F]{4}')
regex3 = re.compile('\\\\u[0-9a-fA-F]{4}')
j = json.dumps(s)
print(j)
print(regex1.findall(j))
print(regex2.findall(j))
print(regex2.findall(j))

Output:

"Test: c:\\user\\\u9a6c\u514b"
['\\user\\', '\\u9a6c', '\\u514b']  # oops
['\\u9a6c', '\\u514b']
['\\u9a6c', '\\u514b']
  • Related