Can we use Kstream with Spark in 2021?
Is it a recommended approach or using the Spark Streaming is a better solution.
CodePudding user response:
Can we use Kstream with Spark in 2021?
Sure.
Is it a recommended approach
Not really (if at all).
using the Spark Streaming is a better solution.
"Better" begs for another question "In what scenarios?"
Kafka Streams is a library and as such can be used anywhere a library could be used in an application, incl. Spark applications. In that sense, it is possible, but not really of much help IMHO.
Kafka Streams is simply a bunch of threads that use Consumer and Producer APIs to transform records. That's the Kafka data format in Spark Structured Streaming exactly.
Kafka Streams applications are deployed as standalone Java applications (e.g. Docker containers in k8s). The same is also possible with Spark Structured Streaming (with Spark on Kubernetes).
See no real benefit using both in a single application but would love proven wrong.
CodePudding user response:
You cannot use Kafka Streams "with" Spark Streaming.
For example, Kafka Streams would consume singular records from a topic, which you would map to a single element Spark RDD with a parallelize call? You're then not using Spark Streaming libraries
Other way, you consume RDD from Spark Streaming. Then, there's no way to get that data into a Kafka Streams topology...
So, they are not compatible to be used "together".
You can, alternatively, deploy a Spark application that may "include" Kafka Streams topologies, for example if you were wanting a KTable as part of each RDD action, but this is no different answer than any embedding Kafka Streams in any other JVM application. Just keep in mind that non-streaming Spark executors are short lived and ephemeral and any Kafka Streams state wouldn't be stored with Spark checkpoints
Besides that, if you use Kubernetes, for example as a Spark Scheduler, then you can deploy separate containers; one Kafka Streams app could consume or produce data into Spark Streaming, or vice versa.