I'm having trouble finding the right way to join to RDD's in PySpark to achieve a desired result.
Here is the first RDD:
------ ---
| _1| _2|
------ ---
|Python| 36|
| C| 6|
| C#| 8|
------ ---
Here is the second RDD:
------ ---
| _1| _2|
------ ---
|Python| 10|
| C| 1|
| C#| 1|
------ ---
Here is the result I want:
------ --- ---
| _1| _2| _3|
------ --- ---
|Python| 36| 10|
| C| 6| 1|
| C#| 8| 1|
------ --- ---
I've tried all sorts of .join()
and .union()
variations between the two RDD's but can't get it right, any help would be greatly appreciated!!
CodePudding user response:
With RDD
rdd1 = sc.parallelize([('python', 36), ('c', 6), ('c#', 8)])
rdd2 = sc.parallelize([('python', 10), ('c', 1), ('c#', 1)])
rdd1.join(rdd2).map(lambda x: (x[0], *x[1])).toDF().show()
------ --- ---
| _1| _2| _3|
------ --- ---
|python| 36| 10|
| c| 6| 1|
| c#| 8| 1|
------ --- ---
With DF
df1 = rdd1.toDF(['c1', 'c2'])
df2 = rdd2.toDF(['c1', 'c3'])
rdd3 = df1.join(df2, on=['c1'], how='inner').rdd