How to decode text with special symbols using base64 in python3?-CodePudding

I am trying to decode some list of texts using base64 module. Though I'm able to decode some, but probably the ones which have special symbols included in it I am unable to decode that.

import base64

# List of string which we are trying to decode
encoded_text_list = ['MTA0MDI0','MTA0MDYw','MTA0MDgz','MTA0MzI=']
    
# Iterating and decoding string using base64    
for k in encoded_text_list:
    print(k, base64.b64decode(k).decode())

Output:

MTA0MDI0 104024
MTA0MDYw 104060
MTA0MDgz 104083

---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
<ipython-input-60-d1ba00f4e54a> in <module>
      2 for k in member_url_list:
      3     print(k)
----> 4     print(base64.b64decode(k).decode())
      5     # break

/usr/lib/python3.6/base64.py in b64decode(s, altchars, validate)
     85     if validate and not re.match(b'^[A-Za-z0-9 /]*={0,2}$', s):
     86         raise binascii.Error('Non-base64 digit found')
---> 87     return binascii.a2b_base64(s)
     88 
     89 

Error: Incorrect padding

The script works well but as it reaches to decode string 'MTA0MzI=' it gives the above error.

As above text list is based on url, so also tried with parse method of urllib.

from urllib.parse import unquote
b64_string = 'MTA0MzI='
b64_string = unquote(b64_string) # 'MTA0MzI=' 
b64_string  = "=" * ((4 - len(b64_string) % 4) % 4)
print(base64.b64decode(b64_string).decode())

Output:

Expected Output:

Now the output may seems to be correct, but it isn't as it converts the input text from 'MTA0MzI=' to 'MTA0MzI=' and so does it's output from '104327' to '10432'. Thing is the above text with symbol works perfectly on this base64 site.

I have tried in different versions on python i.e python 2, 3.6, 3.8, etc., I have also tried codecs module & explored some base64 functions, but got no positive response. Can someone please help me to make it working or suggest any other way to get it done.

CodePudding user response：

The problem is % is not a valid base64 character. The decoder expects an = sign there, and instead find =, which happens to be the URL encoding of =. This likely means the value is url encoded somewhere upstream from your code. Depending on requirements, you have some options:

Call k = parse(k); see builtin parse function
Call k = k.replace('=', '=') to clean up this error
Change the inputs to not be url encoded

CodePudding user response：

These are url-quoted strings, so url-unquoting is the correct procedure. The first step is unquote them with urllib.parse.unquote. Only after that should you attempt base64-decoding and there's no need to manually mess around with the base64 padding character =.

The website you reference ignores invalid base64 characters and also infers the padding from the length of the base64-encoded data. So you give the website MTA0MzI= and it throws away the % because it's not valid base64 char, then processes MTA0MzI3D and returns 104327. Base64 padding is redundant and I'm not sure why some base64 encoding standards specify to have it in there but many do.

Example:

import base64
import urllib.parse

# List of string which we are trying to decode
encoded_text_list = ['MTA0MDI0', 'MTA0MDYw', 'MTA0MDgz', 'MTA0MzI=']

# Iterating and decoding string using base64
for k in encoded_text_list:
    url_unquoted = urllib.parse.unquote(k)
    print(k, base64.b64decode(url_unquoted).decode('utf-8'))

Output

MTA0MDI0 104024
MTA0MDYw 104060
MTA0MDgz 104083
MTA0MzI= 10432

and 10432 is the correct output, not 104327.