Home > Software engineering >  How to join two RDD's in PySpark?
How to join two RDD's in PySpark?

Time:04-11

I'm having trouble finding the right way to join to RDD's in PySpark to achieve a desired result.

Here is the first RDD:
 ------ --- 
|    _1| _2|
 ------ --- 
|Python| 36|
|     C|  6|
|    C#|  8|
 ------ --- 

Here is the second RDD:
 ------ --- 
|    _1| _2|
 ------ --- 
|Python| 10|
|     C|  1|
|    C#|  1|
 ------ --- 

Here is the result I want:
 ------ --- --- 
|    _1| _2| _3|
 ------ --- --- 
|Python| 36| 10|
|     C|  6|  1|
|    C#|  8|  1|
 ------ --- --- 

I've tried all sorts of .join() and .union() variations between the two RDD's but can't get it right, any help would be greatly appreciated!!

CodePudding user response:

With RDD
rdd1 = sc.parallelize([('python', 36), ('c', 6), ('c#', 8)])
rdd2 = sc.parallelize([('python', 10), ('c', 1), ('c#', 1)])
rdd1.join(rdd2).map(lambda x: (x[0], *x[1])).toDF().show()
 ------ --- --- 
|    _1| _2| _3|
 ------ --- --- 
|python| 36| 10|
|     c|  6|  1|
|    c#|  8|  1|
 ------ --- --- 
With DF
df1 = rdd1.toDF(['c1', 'c2'])
df2 = rdd2.toDF(['c1', 'c3'])
rdd3 = df1.join(df2, on=['c1'], how='inner').rdd
  • Related