I'm super new to scala and trying to write a little spark streaming program where lines of a text file come in, they're stripped of any non alphanumeric characters and then mapped, reduced and printed out.
I've written the code below and am running it using the following command:
spark-submit --classgroup.WordCount --master yarn --deploy-mode client bdpAssignmentFour.jar hdfs:///user/s3797303'
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object WordCount {
def main(args: Array[String]): Unit = {
val sconf = new SparkConf().setAppName("SparkWordCount")
val ssc = new StreamingContext(sconf, Seconds(5))
val pat = "^[a-zA-Z0-9]*$".r
val lines = ssc.textFileStream(args(0))
val lines_map = lines.map(line => pat.findAllIn(line))
lines_map.print()
val wordCounts = lines_map.map((_, 1)).reduceByKey(_ _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
This seems to being to run just find but then returns an error about the following line: val lines_map = lines.map(line => pat.findAllIn(line))
The error it gives reads as follows: object not serializable (class: scala.util.matching.Regex$MatchIterator, value: empty iterator
What can I do to make the object serializable and hopefully end up with some code that runs?
CodePudding user response:
Quick fix:
- put the
pat
value inline:lines.map(line => "...".r.findAllIn(line))
- or move it inside an
object
outside of your existing method and object
You won't be able to make the regex serializable, your goal is to move it in a place where Spark won't need to serialize it.