What is a drop-in replacement for Python's `open()` function to read/write a file on S3?-CodePudding

What is a good way to replace Python's built-in open() function when working with Amazon S3 buckets in an AWS Lambda function?

Summary

I am looking for a method to download a file from or upload a file to Amazon S3 in an AWS Lambda function.
The syntax/API should similar to Python's built-in open(), specifically returning a file-like object that could be passed to other functions like pandas.read_csv().
- I am mostly interested in read() and write() and not so much seek() or tell(), which would be be required for PIL.Image.open() for example.
The method should use libraries already available in AWS Lambda, e.g. boto3.
It should keep the Lambda deployment size small. Thus not a large dependency like s3fs, which is usually overkill for an AWS Lambda.

Here is an example of what I am thinking of.

filename = "s3://mybucket/path/to/file.txt"
outpath = "s3://mybucket/path/to/lowercase.txt"

with s3_open(filename) as fd, s3_open(outpath, "wt") as fout:
    for line in fd:
        fout.write(line.strip().lower())

Motivation

Most people using Python are familiar with

filename = "/path/to/file.txt"
with open(filename) as fd:
    lines = fd.readlines()

Those using Amazon S3 are also probably familiar with S3 URIs, but S3 URIs are not convenient for working with boto3, the Amazon S3 Python SDK:

boto3 uses parameters like s3.get_object(Bucket=bucket, Key=key), whereas I usually have the S3 URI
boto3 returns a json response, which contains a StreamingBody and all I want is the StreamingBody
The StreamingBody returns bytes but text is usually more convenient

Many Python libraries accept file-like objects, e.g. json, pandas, zipfile.

I often just need to download/upload a single file to S3 so there's no need to manage a whole file system. Nor do I need or want to save the file to disk only to read it back into memory.

A start

import io
import boto3

session = boto3.Session()
s3_client = boto3.client("s3")

def s3uriparse(s3_uri):
    raise NotImplmementedError


def s3_open(s3_uri, mode="rt"):
    bucket, key = s3uriparse(s3_uri)
    
    if mode.startswith("r"):
        r = s3_client.get_object(Bucket=bucket, Key=key)
        fileobj = r["StreamingBody"]
        if mode.endswith("t"):
            fileobj = io.TextIOWrapper(fileobj._raw_stream)
        return fileobj
    elif mode.startswith("w"):
        # Write mode
        raise NotImplementedError
    else:
        raise ValueError("Invalid mode")

CodePudding user response：

There is a Python library called smart-open · PyPI.

It's really good, because you can use all the file-handling commands you're familiar with, and it works with S3 objects! It can also read from compressed files.

>>> from smart_open import open
>>>
>>> # stream lines from an S3 object
>>> for line in open('s3://commoncrawl/robots.txt'):
...    print(repr(line))
...    break
'User-Agent: *\n'

>>> # stream from/to compressed files, with transparent (de)compression:
>>> for line in open('smart_open/tests/test_data/1984.txt.gz', encoding='utf-8'):
...    print(repr(line))
'It was a bright cold day in April, and the clocks were striking thirteen.\n'
'Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n'
'wind, slipped quickly through the glass doors of Victory Mansions, though not\n'
'quickly enough to prevent a swirl of gritty dust from entering along with him.\n'

>>> # can use context managers too:
>>> with open('smart_open/tests/test_data/1984.txt.gz') as fin:
...    with open('smart_open/tests/test_data/1984.txt.bz2', 'w') as fout:
...        for line in fin:
...           fout.write(line)
74
80
78
79

>>> # can use any IOBase operations, like seek
>>> with open('s3://commoncrawl/robots.txt', 'rb') as fin:
...     for line in fin:
...         print(repr(line.decode('utf-8')))
...         break
...     offset = fin.seek(0)  # seek to the beginning
...     print(fin.read(4))
'User-Agent: *\n'
b'User'

>>> # stream from HTTP
>>> for line in open('http://example.com/index.html'):
...     print(repr(line))
...     break
'<!doctype html>\n'

CodePudding user response：

I am confused about your motivation: what is wrong with

s3 = boto3.resource('s3')
s3_object = s3.Object(bucket_name, full_key)
s3_object.put(Body=byte_stream_of_some_kind)

for write and

s3 = boto3.client('s3')
s3_object_byte_stream = s3.get_object(Bucket=bucket_name, Key=object_key)['Body'].read()

for stream in the object to your lambda to update?

Both of these functionalities stream straight into or out of s3 - you don't have to download the object then open it in a stream after the fact - if you want to you can still use a with statement with either to close them up automatically (use it on the resource)

There is also no file system in s3 - though we use object keys with a file system like nomenclature (the '/' and that is displayed in the console as if it were directories, the actual layout of the an s3 internally is flat with the names having some parsing ability. So if you know the full key of an object you can stream it into and out of your lambda without any downloading at all