Context: I'm scraping data from a third party source using a lambda function (this lambda function is invoked my a cloudwatch event bridge so its asynchronous) then writing that data to kinesis firehose which writes it to an S3 bucket. This allows for data buffering and ensures that the data is written to S3 regardless of S3 connection failures (since kinesis will hold on to the data and re try the writes). I'm scraping data from the third party source in chunks (meaning that I make multiple http calls) and simultaneously writing them to the firehose.
Question: If my lambda fails mid way while getting data from the third party source, is there a way that I can re invoke the lambda and poll kinesis to see what data exists there to ensure that I'm not re writing the same data to kinesis? Essentially, I want the lambda to pick up fetching data from the same point that it failed.
CodePudding user response:
This would be a good use case for Lambda Destinations. On a failure, you route the result to Lambda, SNS, SQS, or EventBridge. Your business logic can then determine what to do, such as repeat the request.
CodePudding user response:
If my lambda fails mid way while getting data from the third party source, is there a way that I can re invoke the lambda and poll kinesis to see what data exists there to ensure that I'm not re writing the same data to kinesis?
No. Kinesis Firehose is not Kinesis Data Streams, and you can't read from Firehose as you do with Data Streams. I think the easiest way for you, would be to setup DynamoDB (or any equivalent) which would store some kind of "bookmark" allowing you to see what you have recently processed.