I have RDD[(String, String, Int)] and want to use reduceByKey to get the result as shown below. I don't want it to convert to DF and then perform groupBy operation to get result. Int is constant with value as 1 always.
Is it possible to use reduceByKey here to get the result? Presenting it in Tabularformat for easy reading
Question
String | String | Int |
---|---|---|
First | Apple | 1 |
Second | Banana | 1 |
First | Flower | 1 |
Third | Tree | 1 |
Result
String | String | Int |
---|---|---|
First | Apple,Flower | 2 |
Second | Banana | 1 |
Third | Tree | 1 |
CodePudding user response:
You can not use reduceByKey
if you have a Tuple3, you could use reduceByKey
though if you had RDD[(String, String)]
.
Also, once you groupBy
, you can then apply reduceByKey
, but since keys would be unique, it makes no sense to call reduceByKey
, therefore we use map
to map one on one values.
So, assume df
is your main table, then this piece of code:
val rest = df.groupBy(x => x._1).map(x => {
val key = x._1 // ex: First
val groupedData = x._2 // ex: [(First, Apple, 1), (First, Flower, 1)]
// ex: [(First, Apple, 1), (First, Flower, 1)] => [Apple, Flower] => Apple, Flower
val concat = groupedData.map(d => d._2).mkString(",")
// ex: [(First, Apple, 1), (First, Flower, 1)] => [1, 1] => 2
val sum = groupedData.map(d => d._3).sum
(key, concat, sum) // return a tuple3 again, same format
})
Returns this result:
(Second,Banana,1)
(First,Apple,Flower,2)
(Third,Tree,1)
EDIT
Implementing reduceByKey
without sum, if your dataset looks like:
val data = List(
("First", "Apple"),
("Second", "Banana"),
("First", "Flower"),
("Third", "Tree")
)
val df: RDD[(String, String)] = sparkSession.sparkContext.parallelize(data)
Then, this:
df.reduceByKey((acc, next) => acc "," next)
Gives this:
(First,Apple,Flower)
(Second,Banana)
(Third,Tree)
Good luck!