Is it possible to download the first 100 lines of a big file with millions of lines from S3?-CodePudding

I have multiple 100MB raw files with series of user activities in CSV format. I only want to download the first 100 lines of the files.

The problem is that each file may have different CSV header columns, and data values, because they are user activities from multiple subdomains using different activity tracking providers. This means that each line can be 50 characters long or 500 characters long and it is unknown until I read them all.

S3 supports getObject API with Range parameter which you can use to download the specific ranges of XX bytes of the file.

https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html#API_GetObject_RequestSyntax

If I use this API to parse first 1Mb of files, iterate each byte until I see 100 new lines character \n, would that technically work? Is there something that I have to be careful about this approach? (e.g. multibyte chars?)

CodePudding user response：

You can use smart_open like this:

from smart_open import open

with open('s3://bucket/path/file.csv', 'r') as f:
    csv_reader = csv.DictReader(f, delimiter=',')
    data = ''
    for i, row in enumerate(csv_reader):
        data  = row  '\n'
        if i > 100:
            store(data)

You will need to open another file in your localmachine with write permission to store a 100 lines or as many as you like. If you want the 1st lines from multiple files, you can do the same but using the boto3 function for listing the files and send the path/file name to a function using smart_open.

s3client = boto3.client('s3')

listObj = s3client.list_objects_v2(Bucket=bucket, Prefix=prefix)
for obj in listObj['Contents']:
    smart_function(obj['Key'])

The obj['Key'] contains the path and file name of each file in that Bucket Path(Prefix)

CodePudding user response：

There is no built-in way, byte-range fetches are the best way forward.

As you're not sure of the header or line length in each case, downloading 1MB chunks until you have 100 line is a safe & efficient approach.

Multibyte chars etc. won't be important as at this level, you're purely looking to stop reading after 100 \n characters. Depending on the source of your files, however, I would be also conscious of \r\n and \r as being valid line endings.

I've written the below Java code for getting the last n bytes, feel free to use it as a starting point for getting the first n bytes:

public String getLastBytesOfObjectAsString(String bucket, String key, long lastBytesCount) {
    try {
        final ObjectMetadata objectMetadata = client.getObjectMetadata(bucket, key);
        final long fileSizeInBytes = objectMetadata.getContentLength();

        long rangeStart = fileSizeInBytes - lastBytesCount;
        if (rangeStart < 0) {
            rangeStart = 0;
        }

        final GetObjectRequest getObjectRequest =
                    new GetObjectRequest(bucket, key).withRange(rangeStart);

        try (S3Object s3Object = client.getObject(getObjectRequest);
             InputStream inputStream = s3Object.getObjectContent()) {
            return new String(inputStream.readAllBytes());
        }
    } catch (Exception ex) {
        ...
    }
}