PySpark: Working with 2 RDDs, element-wise comparison-CodePudding

Imagine I have two RDDS I would like to compare elementwise:

data1 = [1,2,3]
rdd1 = spark.sparkContext.parallelize(data1)
data2 = [7,8,9]
rdd2 = spark.sparkContext.parallelize(data2)

What's the best way to multiply them element-wise so that I end up with the following array?

rdd3 = [[7,8,9], [14,16,18], [21,24,27]]

I have a feeling it's a join operation but I'm not sure how to set up the key value pairs.

CodePudding user response：

Try cartesian, something like this:

data1 = [1,2,3]
rdd1 = spark.sparkContext.parallelize(data1)
data2 = [[7,8,9]]
rdd2 = spark.sparkContext.parallelize(data2)
rdd1.cartesian(rdd2).map(lambda x: [x[0]*i for i in x[1]]).collect()

CodePudding user response：

You can find the cartesian join of the rdd's and then reduce them to get the list.

Note: Spark is a distributed processing engine and the reduceByKey can return the final list in any order. If you want strong ordering guarantees, enrich your RDDs to include a index element.


data1 = [1,2,3]
rdd1 = spark.sparkContext.parallelize(data1)
data2 = [7,8,9]
rdd2 = spark.sparkContext.parallelize(data2)

rdd1.cartesian(rdd2)\
    .map(lambda x: (x[0], [x[0] * x[1]]))\
    .reduceByKey(lambda x, y: x   y)\
    .map(lambda x: x[1]).collect()

Output

[[7, 8, 9], [14, 16, 18], [21, 24, 27]]