Home > Software design >  Simplest Way to Stream Data to Redshift
Simplest Way to Stream Data to Redshift

Time:09-30

We are migrating and refactoring some services to AWS and need to verify the accounting information matches the legacy system before we switch over. We do call tracking/analytics and effectively have a stream of call metadata coming into a SNS/SQS processing system. We want to temporarily write the output someplace to compare against the legacy system and identify any discrepancies. This is throw away work, so we want to keep it simple/cheap.

Kinesis to Redshift, using the new Streaming Ingestion is one way to do this. The engineer in me likes this way. However, the simplest thing I can think of is to create an S3 bucket and create 1 file per event (call), and ingest to Redshift from there. Neither option will be expensive at our usage levels, but S3 is very, very inexpensive.

My question is if there is any reason not to have 1 record per S3 file (batching is the typical use case it seems) for ingest to Redshift? We'll generate about 100K records/day, so we're far from big in today's standards.

Cheers! Rich

CodePudding user response:

Yes, don't do this if there is a large amount of data or put differently if you have a very large number of files. And you don't want to wait hours or days for the data to load.

This is due to the nature of S3, an object store. The files (objects) written or read from S3 are in a large cluster of computers w/ storage and the location within this cluster is based on hashing of the bucket and key. This lookup takes on the order of .5 sec to perform (request for data and first bits arriving). Once the data starts to flow it moves at a reasonably high rate.

When you go to load these many, many very small files the majority of time will be taken up by addressing the object with a very small amount of time moving the data. While Redshift can read objects from S3 in parallel this parallelism is limited by the number of slices in the cluster. Moving a few hundred bytes in .5 sec is a bandwidth that harkens back to the 1980s.

To gain the full available bandwidth from S3 the files stored should be at least 100MB and ideally 1GB in size. Then the time to address the object is insignificant wrt the data transfer time. This reality will be true when reading data no matter what tool doing the reading. Writing files isn't an issue since the data is cached before the final location in the S3 cluster is determined.

You will want to combine your records into larger objects in S3.

BTW a single object isn't ideal either as this won't allow for Redshift to read the data in parallel. Parallel reading can speed up the load by a factor equal to the number of slices in the Redshift cluster (2 to few hundred). A meaningful improvement. You can on the other hand have a nearly unlimited number of objects in S3 which can tank just about all load performance.

  • Related