Home > Blockchain >  Why does this production code work: `base64.b64decode(api_token.encode(“utf-8)).decode(“utf-8”)`?
Why does this production code work: `base64.b64decode(api_token.encode(“utf-8)).decode(“utf-8”)`?

Time:04-01

Today at work, I saw the following line of code:

decoded_token = base64.b64decode(api_token.encode("utf-8")).decode("utf-8")

It is part of an AirFlow ETL script and decoded_token is used as a Bearer Token in an API request. This code is executed on a server that uses Python 2.7 and my coworker told my that this code runs daily, successfully.

Yet, from my understanding, the code first tries to turn api_token into bytes (.encode), then turn the bytes into a string (base64.b64decode) and finally turn the string again into a string (.decode). I would think that this always leads to an error.

import base64
api_token = "random-string"
decoded_token = base64.b64decode(api_token.encode("utf-8")).decode("utf-8")

Running the code locally gives me:

Error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xad in position 0: invalid start byte

What input/type would api_token need to be in order for this line not to throw an error? Is that even possible or must there be something else at play?

Edit: As mentioned by Klaus D., apparently, in Python 2 both encode and decode consumed and returned a string. Yet, running the code above in Python 2.7 gives me the same error and I have yet to find an input for api_token that does not throw an error.

CodePudding user response:

The issue is likely just that your test input string is not a base64-encoded string, while in production, whatever input already is!

Python 2.7.18 (default, Jan  4 2022, 17:47:56)
...
>>> import base64
>>> api_token = "random-string"
>>> base64.b64decode(api_token)
'\xad\xa9\xdd\xa2k-\xae)\xe0'
>>> base64.b64decode(api_token).decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xad in position 0: invalid start byte

encoding the string as base64, you also don't need to decode it as "utf-8" afterwards, though you may if you expect unicode characters

>>> api_token = base64.b64encode(api_token)
>>> api_token
'cmFuZG9tLXN0cmluZw=='
>>> base64.b64decode(api_token)
'random-string'
>>> base64.b64decode(api_token).decode("utf-8")
u'random-string'

Example with non-ascii characters

>>> base64.b64decode(base64.b64encode("random string后缀"))
'random string\xe5\x90\x8e\xe7\xbc\x80'
>>> base64.b64decode(base64.b64encode("random string后缀")).decode("utf-8")
u'random string\u540e\u7f00'
>>> sys.stdout.write(base64.b64decode(base64.b64encode("random string后缀"))   "\n")
random string后缀

Note that in Python 2.7, bytes is just an alias for str, and a special unicode was added to support unicode!

>>> bytes is str
True
>>> bytes is unicode
False
>>> str("foo")
'foo'
>>> unicode("foo")
u'foo'
  • Related