How to stream and read from a .tar.gz boto3 in S3?-CodePudding

On S3 there is a JSON file with the following format:

{"field1": "...", "field2": "...", ...}
{"field1": "...", "field2": "...", ...}
{"field1": "...", "field2": "...", ...}

It is compressed, in .tar.gz format, and its unzipped size is ~30GB, therefore I would like to read it in a streaming fashion.

Using the aws cli, I managed to locally do so with the following command:

aws s3 cp s3://${BUCKET_NAME}/${FILE_NAME}.tar.gz - | gunzip -c -

However, I would like to do it natively in python 3.8.

Merging various solutions online, I tried the following strategies:

1. Uncompressing in-memory file [not working]

import boto3, gzip, json
from io import BytesIO

s3 = boto3.resource('s3')
key = 'FILE_NAME.tar.gz'
streaming_iterator = s3.Object('BUCKET_NAME', key).get()['Body'].iter_lines()
first_line = next(streaming_iterator)

gzipline = BytesIO(first_line)
gzipline = gzip.GzipFile(fileobj=gzipline)
print(gzipline.read())

Which raises

EOFError: Compressed file ended before the end-of-stream marker was reached

2. Using the external library `smart_open` [partially working]

import boto3

for line in open(
    f's3://${BUCKET_NAME}/${FILE_NAME}.tar.gz',
    mode="rb",
    transport_params={"client": boto3.client('s3')},
    encoding="raw_unicode_escape",
    compression=".gz"
):
    print(line)

This second solution works discretely well for ASCII characters, but for some reason it also turns non ASCII characters into garbage; e.g.,

input: \xe5\x9b\xbe\xe6\xa0\x87\xe3\x80\x82
output: å\x9b¾æ\xa0\x87ã\x80\x82
expected output: 图标。

This leads me to think that the encoding I put is wrong, but I literally tried every encoding present in this page and the only ones that don't lead to an Exception are raw_unicode_escape, unicode_escape and palmos (?), but they all produce garbage.

Any suggestion is welcomed, thanks in advance.

CodePudding user response：

The return from a call to get_object() is a StreamingBody object, which as the name implies will allow you to read from the object in a streaming fashion. However, boto3 does not support seeking on this file object.

While you can pass this object to a tarfile.open call, you need to be careful. There are two caveats. First, you'll need to tell tarfile that you're passing it a non-seekable streaming object using the | character in the open string, and you can't do anything that would trigger a seek, such as attempt to get a list of files first, then operate on these files.

Putting it all together is fairly straight forward, you just need to open a object using boto3, then process each file in the tar file in turn:

# Use boto3 to read the object from S3
s3 = boto3.client('s3')
resp = s3.get_object(Bucket='example-bucket', Key='path/to/example.tar.gz')
obj = resp['Body']

# Open the tar file, the "|" is important, as it instructs
# tarfile that the fileobj is non-seekable
with tarfile.open(fileobj=obj, mode='r|gz') as tar:
    # Enumerate the tar file objects as we extract data
    for member in tar:
        with tar.extractfile(member) as f:
            # Read each row in turn and decode it
            for row in f:
                row = json.loads(row)
                # Just print out the filename and results in this demo
                print(member.name, row)

1. Uncompressing in-memory file [not working]

2. Using the external library smart_open [partially working]

2. Using the external library `smart_open` [partially working]