Home > Software engineering >  Zip file contents in AWS Lambda
Zip file contents in AWS Lambda

Time:03-31

We have a function that gets the list of files in a zip file and it works standalone and in Lambda until the fire is larger than 512 meg. The function needs to get a list of files in the zip file and read the contents of a JSON file that should be in the zip file.

This is part of the function:

try:
    s3_object = s3_client.get_object(Bucket=bucketname, Key=filename)
    #s3_object = s3_client.head_object(Bucket=bucketname, Key=filename)
    #s3_object = s3_resource.Object(bucket_name=bucketname, key=filename)
except:
    return ('NotExist')

zip_file = s3_object['Body'].read()
buffer = io.BytesIO(zip_file)
# buffer = io.BytesIO(s3_object.get()['Body'].read())

with zipfile.ZipFile(buffer, mode='r', allowZip64=True) as zip_files:

    for content_filename in zip_files.namelist():
        zipinfo = zip_files.getinfo(content_filename)
        if zipinfo.filename[:2] != '__':
            no_files  = 1
            if zipinfo.filename == json_file:
                json_exist = True
                with io.TextIOWrapper(zip_files.open(json_file), encoding='utf-8') as jsonfile:
                    object_json = jsonfile.read()

The get_object is the issue as it loads gets the file into memory hand obviously the large of the file it is going to be more than it's available in Lambda. I've tried using head_object but that only gives me the meta data for the file and I don't know how to get the list of files in the zip file when using head_object or resource.Object.

I would be grateful for any ideas please.

CodePudding user response:

It would likely be the .read() operation that consumes the memory.

So, one option is to simply increase the memory allocation given to the Lambda function.

Or, you can download_file() the zip file, but Lambda functions are only given 512MB in the /tmp/ directory for storage, so you would likely need to mount an Amazon EFS filesystem for additional storage.

Or you might be able to use smart-open · PyPI to directly read the contents of the Zip file from S3 -- it knows how to use open() with files in S3 and also Zip files.

CodePudding user response:

Not a full answer, but too long for comments.

It is possible to download only part of a file from S3, so you should be able to grab only the list of files and parse that.

The zip file format places the list of archived files at the end of the archive, in a Central directory file header.

You can download part of a file from S3 by specifying a range to the GetObject API call. In Python, with boto3, you would pass the range as a parameter to the get_object() s3 client method, or the get() method of the Object resource.

So, you could read pieces from the end of the file in 1 MiB increments until you find the header signature (0x02014b50), then parse the header and extract the file names. You might even be able to trick Python into thinking it's a proper .zip file and convince it to give you the list while providing only the last piece(s). An elegant solution that doesn't require downloading huge files.

Or, it might be easier to ask the uploader to provide a list of files with the archive. Depending on your situation, not everything has to be solved in code :).

  • Related