reduceByKey RDD spark scala-CodePudding

I have RDD[(String, String, Int)] and want to use reduceByKey to get the result as shown below. I don't want it to convert to DF and then perform groupBy operation to get result. Int is constant with value as 1 always.

Is it possible to use reduceByKey here to get the result? Presenting it in Tabularformat for easy reading

Question

String	String	Int
First	Apple	1
Second	Banana	1
First	Flower	1
Third	Tree	1

Result

String	String	Int
First	Apple,Flower	2
Second	Banana	1
Third	Tree	1

CodePudding user response：

You can not use reduceByKey if you have a Tuple3, you could use reduceByKey though if you had RDD[(String, String)].

Also, once you groupBy, you can then apply reduceByKey, but since keys would be unique, it makes no sense to call reduceByKey, therefore we use map to map one on one values.

So, assume df is your main table, then this piece of code:

val rest = df.groupBy(x => x._1).map(x => {
  val key = x._1 // ex: First
  val groupedData = x._2 // ex: [(First, Apple, 1), (First, Flower, 1)]

  // ex: [(First, Apple, 1), (First, Flower, 1)] => [Apple, Flower] => Apple, Flower
  val concat = groupedData.map(d => d._2).mkString(",")
  // ex: [(First, Apple, 1), (First, Flower, 1)] => [1, 1] => 2
  val sum = groupedData.map(d => d._3).sum

  (key, concat, sum) // return a tuple3 again, same format
})

Returns this result:

(Second,Banana,1)
(First,Apple,Flower,2)
(Third,Tree,1)

EDIT

Implementing reduceByKey without sum, if your dataset looks like:

val data = List(
  ("First", "Apple"),
  ("Second", "Banana"),
  ("First", "Flower"),
  ("Third", "Tree")
)

val df: RDD[(String, String)] = sparkSession.sparkContext.parallelize(data)

Then, this:

df.reduceByKey((acc, next) => acc   ","   next)

Gives this:

(First,Apple,Flower)
(Second,Banana)
(Third,Tree)

Good luck!