Home > Software engineering >  Decode individual octal characters in string variable
Decode individual octal characters in string variable

Time:12-23

A string variable sometimes includes octal characters that need to be un-octaled. Example: oct_var = "String\302\240with\302\240octals", the value of oct_var should be "String with octals" with non-breaking spaces.

Codecs doesn't support octal, and I failed to find a working solution with encode(). The strings originate upstream outside my control.

Python 3.9.8

Edited to add: It doesn't have to scale or be ultra fast, so maybe the idea from here (#6) can work (not tested yet):

def decode(encoded):
    for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
        encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
    return encoded.decode('utf8')

CodePudding user response:

You forgot to indicate that oct_var should be given as bytes:

>>> oct_var = b"String\302\240with\302\240octals"
>>> oct_var.decode()
'String\xa0with\xa0octals'
>>> print(oct_var.decode())
String with octals

CodePudding user response:

Putting your ideas and pointers together, and with the risks that come with the use of an undocumented function[*], i.e, codecs.escape_decode, this line works:

value = (codecs.escape_decode(bytes(oct_var, "latin-1"))[0].decode("utf-8"))

[*] "Internal function means: you can use it on your risk but the function can be changed or even removed in any Python release."

Explanations for for codecs.escape_decode:

https://stackoverflow.com/a/37059682/5309571

Examples for its use:

https://www.programcreek.com/python/example/8498/codecs.escape_decode

Other approaches that may turn out to be more future-proof than codecs.escape_decode (no warranty, I have not tried them):

https://stackoverflow.com/a/58829514/5309571

https://bytes.com/topic/python/answers/743965-converting-octal-escaped-utf-8-a

  • Related