I am trying to decode some list of texts using base64 module. Though I'm able to decode some, but probably the ones which have special symbols included in it I am unable to decode that.
import base64
# List of string which we are trying to decode
encoded_text_list = ['MTA0MDI0','MTA0MDYw','MTA0MDgz','MTA0MzI=']
# Iterating and decoding string using base64
for k in encoded_text_list:
print(k, base64.b64decode(k).decode())
Output:
MTA0MDI0 104024
MTA0MDYw 104060
MTA0MDgz 104083
---------------------------------------------------------------------------
Error Traceback (most recent call last)
<ipython-input-60-d1ba00f4e54a> in <module>
2 for k in member_url_list:
3 print(k)
----> 4 print(base64.b64decode(k).decode())
5 # break
/usr/lib/python3.6/base64.py in b64decode(s, altchars, validate)
85 if validate and not re.match(b'^[A-Za-z0-9 /]*={0,2}$', s):
86 raise binascii.Error('Non-base64 digit found')
---> 87 return binascii.a2b_base64(s)
88
89
Error: Incorrect padding
The script works well but as it reaches to decode string 'MTA0MzI=' it gives the above error.
As above text list is based on url, so also tried with parse method of urllib.
from urllib.parse import unquote
b64_string = 'MTA0MzI='
b64_string = unquote(b64_string) # 'MTA0MzI='
b64_string = "=" * ((4 - len(b64_string) % 4) % 4)
print(base64.b64decode(b64_string).decode())
Output:
10432
Expected Output:
104327
Now the output may seems to be correct, but it isn't as it converts the input text from 'MTA0MzI=' to 'MTA0MzI=' and so does it's output from '104327' to '10432'. Thing is the above text with symbol works perfectly on this base64 site.
I have tried in different versions on python i.e python 2, 3.6, 3.8, etc., I have also tried codecs module & explored some base64 functions, but got no positive response. Can someone please help me to make it working or suggest any other way to get it done.
CodePudding user response:
The problem is %
is not a valid base64 character. The decoder expects an =
sign there, and instead find =
, which happens to be the URL encoding of =
. This likely means the value is url encoded somewhere upstream from your code. Depending on requirements, you have some options:
- Call
k = parse(k)
; see builtin parse function - Call
k = k.replace('=', '=')
to clean up this error - Change the inputs to not be url encoded
CodePudding user response:
These are url-quoted strings, so url-unquoting is the correct procedure. The first step is unquote them with urllib.parse.unquote
. Only after that should you attempt base64-decoding and there's no need to manually mess around with the base64 padding character =
.
The website you reference ignores invalid base64 characters and also infers the padding from the length of the base64-encoded data. So you give the website MTA0MzI=
and it throws away the %
because it's not valid base64 char, then processes MTA0MzI3D
and returns 104327. Base64 padding is redundant and I'm not sure why some base64 encoding standards specify to have it in there but many do.
Example:
import base64
import urllib.parse
# List of string which we are trying to decode
encoded_text_list = ['MTA0MDI0', 'MTA0MDYw', 'MTA0MDgz', 'MTA0MzI=']
# Iterating and decoding string using base64
for k in encoded_text_list:
url_unquoted = urllib.parse.unquote(k)
print(k, base64.b64decode(url_unquoted).decode('utf-8'))
Output
MTA0MDI0 104024
MTA0MDYw 104060
MTA0MDgz 104083
MTA0MzI= 10432
and 10432 is the correct output, not 104327.