I've got strings that look something like this:
a = "testing test<U 00FA>ing <U 00F3>"
Format will not always be like that, but those unicode characters in brackets will be scattered throughout the code. I want to turn those into the actual unicode characters they represent. I tried this function:
def replace_unicode(s):
uni = re.findall(r'<U\ \w\w\w\w>', s)
for a in uni:
s = s.replace(a, f'\u{a[3:7]}')
return s
This successfully finds all of the <U > unicode strings, but it won't let me put them together to create a unicode escape in this manner.
File "D:/Programming/tests/test.py", line 8
s = s.replace(a, f'\u{a[3:7]}')
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
How can I create a unicode escape character using an f-string, or via some other method with the information I'm getting from strings?
CodePudding user response:
chepner's answer is good, but you don't actually need an f-string. int(a[3:7], base=16)
works perfectly fine.
Also, it would make a lot more sense to use re.sub()
instead of re.findall()
then str.replace()
. I would also restrict the regex down to just hex digits and group them.
import re
def replace_unicode(s):
pattern = re.compile(r'<U\ ([0-9A-F]{4})>')
return pattern.sub(lambda match: chr(int(match.group(1), base=16)), s)
a = "testing test<U 00FA>ing <U 00F3>"
print(replace_unicode(a)) # -> testing testúing ó
CodePudding user response:
You can use an f-string to create an appropriate argument to int
, whose result the chr
function can use to produce the desired character.
for a in uni:
s = s.replace(a, chr(int(f'0x{a[3:7]}', base=16)))