Converting from ANSI to UTF-8 in python within a aws lambda-CodePudding

I am receiving a zip file in an s3 bucket. On its put event I have a aws lambda triggered. My lambda is supposed to unzip the file and upload the files inside it to another s3 bucket.

But these files can be a mix of ANSI and UTF-8 files.

I have to convert all of these to UTF-8. Any idea on how I can do it?

def get_utf_encoded_file(
    file,
    file_name: str
):
    is_ansi = False
    try:
        file.read().decode('utf-8')
    except:
        try:
            file.read().decode('cp1252') << I tried to print here, gives empty string
            is_ansi = True
        except Exception as e:
            log.error(f"Unable to parse file {file_name}")
            raise Exception(f"Unable to parse file {file_name}")
            
    if is_ansi:
        byte_stream = None
        temp_file_name = "/tmp/"   str(uuid.uuid4())   ".txt"
        with codecs.open(temp_file_name, "w", encoding='UTF-8') as temp_file:
            temp_file.write(file.read().decode('cp1252'))
                
        with open(temp_file_name, "rb") as temp_file:
            byte_stream = temp_file.read() << I tried print here gives empty byte array
            print(byte_stream)
            
        os.remove(temp_file_name)
        return byte_stream
    else:
        return file

The function that's calling it:

def unzip_to_temp(
    zip: ZipFile
):
    for file_name in zip.namelist():
        file_data = get_utf_encoded_file(file_name, zip.open(file_name))
        upload_to_s3(file_data)

But the ansi files are created as empty files in s3.

CodePudding user response：

You are calling file.read() multiple times. You always read it for utf-8 and you get empty strings for the reads you're doing for ANSI.

You should call it once and save the result, then do the decoding.

Reference: https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects

CodePudding user response：

You're attempting to call read from the same file multiple times. After the first read, the pointer will be at the end of the file, so nothing new will be read.

Rather than that, you can just read the data once, then attempt to decode it. And since you're decoding it in memory, you can skip writing to disk all together and return an encoded version of the string:

def get_utf_encoded_file(
    file,
    file_name: str
):
    data = file.read()
    try:
        data.decode('utf-8')
        # data decodes cleanly as utf-8
        return data
    except:
        pass

    try:
        data = data.decode('cp1252').encode("utf-8")
        # data decodes cleanly as cp1252, is now utf-8
        return data
    except:
        log.error(f"Unable to parse file {file_name}")
        raise Exception(f"Unable to parse file {file_name}")