Home > OS >  AWS Redshift and small datasets
AWS Redshift and small datasets

Time:01-20

I have S3 bucket to which many different small files (2 files 1kB per 1 min) are uploaded.

  1. Is it good practice to injest them by trigger using lambda at once to Redshift?

  2. Or maybe will it be better to push them to some stage area like Postgres and then at the end of the day do batch etl from stage area to Redshift?

  3. Or maybe do the job of making manifest file that contains all of the file names per day and use COPY command for injesting them to Redshift?

CodePudding user response:

As Mitch says, #3. Redshift wants to work on large data sets and if you ingest small things many times you will need to vacuum the table. Loading many files at once fixes this.

However there is another potential problem - your files are too small for efficient bulk retrieval from S3. S3 is an object store and each request needs to be translated from bucket/object-key pair to a location in S3. This takes on the order of .5 seconds to do. Not an issue for loading a few at a time. But if you need to load a million of them in series then that’s 500K seconds of lookup time. Now Redshift will do the COPY in parallel but only to the number of slices you have in your cluster - it is still going to take a long time.

So depending on your needs you may need to think about a change in your use of S3. If so then your may end up with a Lambda that combines small files into bigger ones as part of your solution. You can do this in a parallel process to RS COPY if you only need to load many, many files at once during some recovery process. But an archive of 1 billion 1KB files will be near useless if they need to be loaded quickly.

  • Related