Element-wise addition of RDDs in PySpark-CodePudding

Suppose you have two vectors of the same size that are stored as rdd1 and rdd2. Please write a function where the inputs are rdd1 and rdd2, and the output is a rdd which is the element-wise addition of rdd1 and rdd2. You should not load all data to the driver program.

Hint: You may use zip() in Spark, not the zip() in Python.

I do not understand what it wrong with the below code, and whether it is correct or not. When I run it, it takes forever. Would you be able to help me with this? Thanks.

spark = SparkSession(sc)

numPartitions = 10
rdd1 = sc.textFile('./dataSet/points.txt',numPartitions).map(lambda x: int(x.split()[0]))
rdd2 = sc.textFile('./dataSet/points.txt',numPartitions).map(lambda x: int(x.split()[1]))


def ele_wise_add(rdd1, rdd2): 
    rdd3 = rdd1.zip(rdd2).map(lambda x,y: x   y)
    return rdd3

rdd3 = ele_wise_add(rdd1, rdd2)
print(rdd3.collect())

rdd1 and rdd2 have 10000 numbers each, and below are the first 10 numbers in it.

rdd1 = [47461, 93033, 92255, 33825, 90755, 3444, 48463, 37106, 5105, 68057]
rdd2 = [30614, 61104, 92322, 330, 94353, 26509, 36923, 64214, 69852, 63315]

expected output = [78075, 154137, 184577, 34155, 185108, 29953, 85386, 101320, 74957, 131372]

CodePudding user response：

You do not need .map(lambda x,y: x y) I suspect. It ran fine for me with that change.

CodePudding user response：

rdd1.zip(rdd2) would create a single tuple for each pair, so when writing lambda function, you only have x and not y. So you'd want to sum(x) or x[0] x[1], not x y.

rdd1 = spark.sparkContext.parallelize((47461, 93033, 92255, 33825, 90755, 3444, 48463, 37106, 5105, 68057))
rdd2 = spark.sparkContext.parallelize((30614, 61104, 92322, 330, 94353, 26509, 36923, 64214, 69852, 63315))

rdd1.zip(rdd2).map(lambda x: sum(x)).collect()
[78075, 154137, 184577, 34155, 185108, 29953, 85386, 101320, 74957, 131372]