I have a Lambda function that's used to read a gzip file and generate a sha512 hash value based on the content inside the file. The hash value will then be used to compare with the expected hash value to determine if the file has been manipulated. Previously, the code was written in the following way, which causes the function to run indefinitely if it encounters huge files:
import gzip
import hashlib
with gzip.open(csv_object["Body"], 'rb') as f:
for l in f:
content = l
print(hashlib.sha512(content).hexdigest())
Note: The csv_object["Body"] basically points to the file in our s3 bucket.
To solve the issue, I have written it in another way and have tested it first in my local IDE. The code is as below:
import hashlib
import gzip
sha512 = hashlib.sha512()
with gzip.open('test_file.csv.gz', 'rb') as file:
sha512.update(file.read())
print(sha512.hexdigest())
From my IDE, the results can be printed out almost instantly, but when I pasted it over and run in lambda, it failed. We then found out that the issue is likely caused by placing csv_object["Body"] as the argument of gzip.open(). The error we got from CloudWatch was:
[ERROR] TypeError: expected str, bytes or os.PathLike object, not StreamingBody
Have anyone encountered this?
CodePudding user response:
In both examples, you're attempting to load the entire file into memory. This will create a hard limit on the file size that you can use in a Lambda, since by default they have fairly small memory allocations.
Rather than doing that, you can both decompress the object in a streaming fashion, and pass that stream to hashlib to calculate the hash, which will easily allow large files to be processed within a Lambda execution environment:
import gzip
import hashlib
import boto3
s3 = boto3.client('s3')
resp = s3.get_object(Bucket=bucket, Key=key)
# Pass the StreamingBody to gzip to decompress it as a stream
gz = gzip.GzipFile(fileobj=resp['Body'])
# Read the file in small chunks to stream the results into hashlib
sha512 = hashlib.sha512()
while True:
# Read 8 MiB at a time
data = gz.read(8388608)
if len(data) == 0:
break
sha512.update(data)
digest = sha512.hexdigest()
# -- or for Python 3.11 --
# And pass the decompressed stream to hashlib to hash it as a stream
digest = hashlib.file_digest(gz, 'sha512').hexdigest()