If I have KBs of data every few seconds, is using Kinesis Firehose with a Lambda function to perform a transformation and using Redshift as the target necessarily better than just doing the same except with S3 instead of Kinesis? I know Kinesis is intended for real-time processing but is there actually a benefit to using it rather than just using S3 and having files dropped into S3 trigger a lambda function for processing and storing into Redshift? They seem equivalent other than that Kinesis is associated with real-time processing while S3 is not.
CodePudding user response:
Amazon Kinesis Data Firehose can combine streams of data into fewer, larger objects in Amazon S3 based on size or time. This makes it easier to store in S3 and load into Redshift.
Amazon Redshift performs poorly if you are continually using INSERT
on a few rows of data, compared to using COPY
on a larger set of data (which allows parallel loading too).
Amazon Kinesis Firehose will take care of the full process of receiving data through to inserting it into Redshift. If you want to do it yourself, you'll presumably trigger an AWS Lambda function for each object and you'll need to write code to insert it into Redshift and handle errors. It's really a matter of balancing cost against convenience.
CodePudding user response:
The biggest thing Kinesis Firehose is the Firehose functionality. This part bundles up the Kinesis events into S3 files of reasonably size. Loading many small files into Redshift can be very inefficient. So your Lambda process will also need to bundle into Redshift loadable files significant size.