A string variable sometimes includes octal characters that need to be un-octaled. Example: oct_var = "String\302\240with\302\240octals"
, the value of oct_var
should be "String with octals"
with non-breaking spaces.
Codecs doesn't support octal, and I failed to find a working solution with encode()
. The strings originate upstream outside my control.
Python 3.9.8
Edited to add: It doesn't have to scale or be ultra fast, so maybe the idea from here (#6) can work (not tested yet):
def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')
CodePudding user response:
You forgot to indicate that oct_var
should be given as bytes:
>>> oct_var = b"String\302\240with\302\240octals"
>>> oct_var.decode()
'String\xa0with\xa0octals'
>>> print(oct_var.decode())
String with octals
CodePudding user response:
Putting your ideas and pointers together, and with the risks that come with the use of an undocumented function[*], i.e, codecs.escape_decode
, this line works:
value = (codecs.escape_decode(bytes(oct_var, "latin-1"))[0].decode("utf-8"))
[*] "Internal function means: you can use it on your risk but the function can be changed or even removed in any Python release."
Explanations for for codecs.escape_decode
:
https://stackoverflow.com/a/37059682/5309571
Examples for its use:
https://www.programcreek.com/python/example/8498/codecs.escape_decode
Other approaches that may turn out to be more future-proof than codecs.escape_decode
(no warranty, I have not tried them):
https://stackoverflow.com/a/58829514/5309571
https://bytes.com/topic/python/answers/743965-converting-octal-escaped-utf-8-a