Decoding a fetched email header or subject into a readable manner-CodePudding

I get emails with unique subjects, and I want to save them. I tried this (stage with credentials input is omitted)

import email
import imaplib
suka.select('Inbox')
key = 'FROM'
value = 'TBD'
_, data = suka.search(None, key, value)
mail_id_list = data[0].split()
msgs = [] 
for num in mail_id_list:
    typ, data = suka.fetch(num, '(RFC822)')
    msgs.append(data)
for msg in msgs[::-1]:
    for response_part in msg:
        if type(response_part) is tuple:
            my_msg=email.message_from_bytes((response_part[1]))
            print ("subj:", my_msg['subject'])
            
            for part in my_msg.walk():  
                #print(part.get_content_type())
                if part.get_content_type() == 'text/plain':
                    print (part.get_payload())

I do get the subjects, but in a form of "subj: =?utf-8?B?0LfQsNGP0LLQutCwIDIxXzE0MTIyMg==?=". Thus, a decoding is required. The secret seems to be, which variable needs to be decoded? Also tried the other way:

yek, do = suka.uid('fetch', govno,('RFC822'))

, where govno is the latest email in the inbox. The output is "can't concat int to bytes". Thus, is there a way to decode the subjects as they appear in the email client? Thank you.

CodePudding user response：

There is a built-in decode_header() method.

Decode a message header value without converting the character set. The header value is in header.

This function returns a list of (decoded_string, charset) pairs containing each of the decoded parts of the header. charset is None for non-encoded parts of the header, otherwise a lower case string containing the name of the character set specified in the encoded string.

>>> from email.header import decode_header
>>> decoded_headers = decode_header("=?utf-8?B?0LfQsNGP0LLQutCwIDIxXzE0MTIyMg==?=")
>>> decoded_headers
[(b'\xd0\xb7\xd0\xb0\xd1\x8f\xd0\xb2\xd0\xba\xd0\xb0 21_141222', 'utf-8')]
>>> first = decoded_headers[0]
>>> first[0].decode(first[1])
'заявка 21_141222'

You can decode the actual value returned by decode_header using the charset returned by it.

For follow-up question, here's a helper function to get the header value in case of multiline header value which handlers errors -

from email.header import decode_header

def get_header(header_text, default='utf8'):
    try:
        headers = decode_header(header_text)
    except:
        print('Error while decoding header, using the header without decoding')
        return header_text

    header_sections = []
    for text, charset in headers:
        try:
            # if charset does not exist, try to decode with utf-8
            header_section = text.decode(charset or 'utf-8')
        except:
            # if we fail to decode the text(header is incorrectly encoded)
            # Try to do decode, and ignore the decoding errors
            header_section = text.decode('utf-8', errors='ignore')
        if header_section:
            header_sections.append(header_section)
    return ' '.join(header_sections)

print(get_header("=?utf-8?B?0LfQsNGP0LLQutCwIDIxXzE0MTIyMg==?="))