Combine two different RDDs with different key in Scala-CodePudding

I have two text file already create as rdd by sparkcontext.

one of them(rdd1) saves related words:

apple,apples
car,cars
computer,computers

Another one(rdd2) saves number of items:

(apple,12)
(apples, 50)
(car,5)
(cars,40)
(computer,77)
(computers,11)

I want to combine those two rdds

disire output:

(apple, 62)
(car,45)
(computer,88)

How to code this?

CodePudding user response：

The meat of the work is to pick a key for the related words. Here I just select the first word but really you could do something more intelligent than just picking a random word.

Explanation:

Create the data
Pick a key for related words
Flatmap the tuples to enable us to join on the key we picked.
Join the RDDs
Map the RDD back into a tuple
Reduce by Key

val s = Seq(("apple","apples"),("car","cars")) // create data
val rdd = sc.parallelize(s)
val t = Seq(("apple",12),("apples", 50),("car",5),("cars",40))// create data
val rdd2 = sc.parallelize(t)
val keyed = rdd.flatMap( {case(a,b) => Seq((a, a),(b,a)) } ) // could be replace with any function that selects the key to use for all of the related words
 .join(rdd2) // complete the join 
 .map({case (_, (a ,b)) => (a,b) }) // recreate a tuple and throw away the related word
 .reduceByKey(_   _)
 .foreach(println) // to show it works

Even though this solves your problem there are more elegant solutions that you could use with Dataframes you may wish to look into. You could use reduce directly on RDD and skip the step of mapping back to a tuple. I think that would be a better solution but wanted to keep it simple so that it was more illustrative of what I did.