I want to know the items that repeated a specified number of times in spark(scala).
I have rdd like this
rdd = [ text1,text2,text3,text4,text2,text4,text1,text1 ]
if the time = 2
the output should be [text2,text4]
CodePudding user response:
Say you have an RDD
that has been created like this:
val df: RDD[String] = spark.sparkContext.parallelize(Seq(
"text1", "text2", "text3", "text1", "text2", "text4"
))
You can use countByValue
followed by a filter
and keys
, where 2 is your time
value:
df.countByValue().filter(tuple => tuple._2 == 2).keys
If we do a println
, we ge the following output:
[text1, text2]
Hope this is what you want, good luck!