Home > Mobile >  Spark Structured Streaming Deduplication with Watermark
Spark Structured Streaming Deduplication with Watermark

Time:09-27

I would like to use Spark Structured Streaming for an ETL job where each event is of form:

{
   "signature": "uuid",
   "timestamp: "2020-01-01 00:00:00",
   "payload": {...}
}

The events can arrive late up to 30 days and can include duplicates. I would like to deduplicate them based on the "signature" field.

If I use the recommended solution:

streamingDf \
  .withWatermark("timestamp", "30 days") \
  .dropDuplicates("signature", "timestamp")
  .write

would that track (keep in memory, store etc) a buffer of the full event content (which can be quite large) or will it just track the "signature" field values ?

Also, would the simple query like the above write new events immediately as new data arrives or would it "block" for 30 days?

CodePudding user response:

"would that track (keep in memory, store etc) a buffer of the full event content (which can be quite large) or will it just track the "signature" field values ?"

Yes, it will keep all columns of streamingDf and not only the signature and timestamp columns.

"Also, would the simple query like the above write new events immediately as new data arrives or would it "block" for 30 days?"

This query will not write events immediately as new data arrives as you are using a watermark, which means - as you expected - it will wait for 30 days before being able to decide if there is a duplicate flowing through the stream.


From my personal experience with streaming applications, I really do not recommend your approach for de-duplicating messages. Keeping the state for 30 days is quite challenging from an operational point of view. Remember, that any small network glitch, power outage, planned/unplanned maintenance of your OS etc. could cause your application to fail or produce wrong results.

I highly recommend to de-duplicate your data through another approach, like e.g. writing the data into a Delta Table or any other format or database.

  • Related