Home > database >  Merge S3 files into multiple <1GB S3 files
Merge S3 files into multiple <1GB S3 files

Time:07-05

I have multiple S3 files in a bucket.

Input S3 bucket : 
File1 - 2GB data
File 2 - 500MB data
File 3 - 1Gb Data
file 4 - 2GB data

and so on. Assume there are 50 such files. Data within files is of same schema, lets say attribute1, attribute 2.

I want to merge these files and output into a new bucket as follows, such that each file is less than 1GB in same schema as before.

Files 1 - < 1GB 
Files 2 - < 1GB
Files 3 - < 1GB

I am looking for AWS based solutions which I can deliver using AWS CDK. I was considering following two solutions :

  1. AWS Athena - reads and writes to S3 but not sure if I can set up a 1GB limit while writing.
  2. AWS Lambda - read file sequentially, store in memory, when size is near 1GB, write to new file in s3 bucket. Repeat until all files completed. I'm worried about the 15 min timeout, not sure if lambda will be able to process.

Expected scales -> Overall file input size sum : 1 TB

What would be a good way to go about implementing this? Hope I have phrased the question right, I'd be happy to comment if any doubts.

Thanks!

Edit : Based on a comment -> Apologies for calling it a merge. More of a reset. All files have the same schema, placed in csv files. In terms of pseudo code

    List<Files> listOfFiles = ReadFromS3(key)
    New file named temp.csv
    for each file : listOfFiles : 
        append file to temp.csv
    List<1GBGiles> finalList = Break down temp.csv into sets of 1GB each
    for(File file : finalList) 
        writeToS3(finalList)

CodePudding user response:

Amazon Athena can run a query across multiple objects in a given Amazon S3 path, as long as they all have the same format (eg same columns in a CSV file).

It can store the result in a new External Table, with a location pointing to an S3 bucket, by using a CREATE TABLE AS command and a LOCATION parameter.

The size of the output files can be controlled by setting the number of output buckets (which is not the same as an S3 bucket).

See:

CodePudding user response:

If your process includes ETL(Extraction Transformation Load) post process, you could use AWS GLUE Please find here an example for Glue using s3 as a source.

  • Related