I'm new to AWS, and I'm working on archiving data from DynamoDB to S3. This is my solution and I have done the pipeline.
DynamoDB -> DynamoDB TTL DynamoDB Stream -> Lambda -> Kinesis Firehose -> S3
But I found that the files in S3 has different number of JSON objects. Some files has 7 JSON objects, some has 6 or 4 objects. I have done ETL in lambda, the S3 only saves REMOVE
item, and the JSON has been unmarshall
.
I thought it would be a JSON object in a file, since the TTL value is different for each item, and the lambda would deliver the item immediately when the item is deleted by TTL.
Does it because the Kinesis Firehose batches the items? (It would wait for sometime after collecting more items then saving them to a file) Or there's other reason? Could I estimate how many files it will save if DynamoDB has a new item is deleted by TTL every 5 minutes?
Thank you in advance.
CodePudding user response:
Kinesis Firehose splits your data based on buffer size or interval.
Let's say you have a buffer size of 1MB and an interval of 1 minute. If you receive less than 1MB within the 1 minute interval, Kinesis Firehose will anyway create a batch file out of the received data, even if it is less than 1MB of data.
This is likely happening in scenarios with few data arriving. You can adjust your buffer size and interval to your needs. E.g. You could increase the interval to collect more items within a single batch.
You can choose a buffer size of 1–128 MiBs and a buffer interval of 60–900 seconds. The condition that is satisfied first triggers data delivery to Amazon S3.
From the AWS Kinesis Firehose Docs: https://docs.aws.amazon.com/firehose/latest/dev/create-configure.html