I am receiving a zip file in an s3 bucket. On its put event I have a aws lambda triggered. My lambda is supposed to unzip the file and upload the files inside it to another s3 bucket.
But these files can be a mix of ANSI and UTF-8 files.
I have to convert all of these to UTF-8. Any idea on how I can do it?
def get_utf_encoded_file(
file,
file_name: str
):
is_ansi = False
try:
file.read().decode('utf-8')
except:
try:
file.read().decode('cp1252') << I tried to print here, gives empty string
is_ansi = True
except Exception as e:
log.error(f"Unable to parse file {file_name}")
raise Exception(f"Unable to parse file {file_name}")
if is_ansi:
byte_stream = None
temp_file_name = "/tmp/" str(uuid.uuid4()) ".txt"
with codecs.open(temp_file_name, "w", encoding='UTF-8') as temp_file:
temp_file.write(file.read().decode('cp1252'))
with open(temp_file_name, "rb") as temp_file:
byte_stream = temp_file.read() << I tried print here gives empty byte array
print(byte_stream)
os.remove(temp_file_name)
return byte_stream
else:
return file
The function that's calling it:
def unzip_to_temp(
zip: ZipFile
):
for file_name in zip.namelist():
file_data = get_utf_encoded_file(file_name, zip.open(file_name))
upload_to_s3(file_data)
But the ansi files are created as empty files in s3.
CodePudding user response:
You are calling file.read()
multiple times. You always read it for utf-8
and you get empty strings for the reads you're doing for ANSI
.
You should call it once and save the result, then do the decoding.
Reference: https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
CodePudding user response:
You're attempting to call read
from the same file multiple times. After the first read
, the pointer will be at the end of the file, so nothing new will be read.
Rather than that, you can just read the data once, then attempt to decode it. And since you're decoding it in memory, you can skip writing to disk all together and return an encoded version of the string:
def get_utf_encoded_file(
file,
file_name: str
):
data = file.read()
try:
data.decode('utf-8')
# data decodes cleanly as utf-8
return data
except:
pass
try:
data = data.decode('cp1252').encode("utf-8")
# data decodes cleanly as cp1252, is now utf-8
return data
except:
log.error(f"Unable to parse file {file_name}")
raise Exception(f"Unable to parse file {file_name}")