Home > Software engineering >  Kinesis Glue S3 takes more than a minute to generate new data
Kinesis Glue S3 takes more than a minute to generate new data

Time:03-01

I'm trying to implement a real-time or near real-time pipeline that updates roughly every 5 seconds.

I created a producer that writes 1 record per second to a kinesis data stream and Hooked it up to a glue job running spark streaming. Once I ran the job I observed the updating of the data in s3 and Athena, and observed that it took 2-3 minutes to batch and save new data.

I upped the number of workers from 2 to 20, but this only sped it up to an update for every 1-2 minutes.

Is this a limitation of spark and how they say spark is near real-time and not actual real-time?

I'm going to attempt to implement something faster with lambda and dynamodb, but I'd really like to know if 5 second updates using Glue is a thing.

Thanks!

CodePudding user response:

By default, AWS Glue processes and writes out data in 100-second windows. This allows data to be processed efficiently and permits aggregations to be performed on data arriving later than expected. You can modify this window size to increase timeliness or aggregation accuracy.

You could try using this function and changing the windowSize

  • Related