Spark structured streaming job parameters for writing .compact files-CodePudding

I'm currently streaming from a file source, but every time a .compact file needs to be written, there's a big latency spike (~5 minutes; the .compact files are about 2.7GB). This is kind of aggravating because I'm trying to keep a rolling window's latency below a threshold and throwing an extra five minutes on there once every, say, half-hour messes with that.

Are there any spark parameters for tweaking .compact file writes? This system seems very lightly documented.

CodePudding user response：

It looks like you ran into a reported bug: SPARK-30462 - Structured Streaming _spark_metadata fills up Spark Driver memory when having lots of objects which was fixed in Spark version 3.1.

Before that version there are no other configurations to prevent the compact file to be growing incrementally while using quite a lot of memory which makes the compaction slow.

Here is description of the Release Note on Structured Streaming:

Streamline the logic on file stream source and sink metadata log (SPARK-30462)

Before this change, whenever the metadata was needed in FileStreamSource/Sink, all entries in the metadata log were deserialized into the Spark driver’s memory. With this change, Spark will read and process the metadata log in a streamlined fashion whenever possible.

CodePudding user response：

No sooner do I throw in the towel, than an answer appears. According to the Jacek Laskowski's book on Spark here: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-properties.html

There is a parameter spark.sql.streaming.fileSource.log.compactInterval that controls this interval. If anyone knows of any other parameters that control this behaviour, though, please let me know!