the item that repeated a specified number of times in spark (scala)-CodePudding

I want to know the items that repeated a specified number of times in spark(scala).

I have rdd like this

rdd = [ text1,text2,text3,text4,text2,text4,text1,text1 ]

if the time = 2

the output should be [text2,text4]

CodePudding user response：

Say you have an RDD that has been created like this:

val df: RDD[String] = spark.sparkContext.parallelize(Seq(
  "text1", "text2", "text3", "text1", "text2", "text4"
))

You can use countByValue followed by a filter and keys, where 2 is your time value:

df.countByValue().filter(tuple => tuple._2 == 2).keys

If we do a println, we ge the following output:

[text1, text2]

Hope this is what you want, good luck!