Increase the output size of Spark Structured Streaming job-CodePudding

Context : I have a Spark Structured Streaming job with Kafka as source and S3 as sink. The outputs in S3 are again picked up as input in other MapReduce jobs. I, therefore, want to increase the output size of the files on S3 so that the MapReduce job works efficiently. Currently, because of small input size, the MapReduce jobs are taking way too long to complete.

Is there a way to configure the streaming job to wait for at least 'X' number of records to process?

CodePudding user response：

No there is not.

You can look here for the next best alternatives: How to specify batch interval in Spark Structured Streaming? options 2 & 3 could go part way towards meeting your goal but no guarantee.

CodePudding user response：

Probably you want to wait micro batch trigger till sufficient data are available at source . You can use minOffsetsPerTrigger option to wait for sufficient data available in kafka . Make sure to set sufficient maxTriggerDelay time as per your application need .