Suppose you have two vectors of the same size that are stored as rdd1 and rdd2. Please write a function where the inputs are rdd1 and rdd2, and the output is a rdd which is the element-wise addition of rdd1 and rdd2. You should not load all data to the driver program.
Hint: You may use zip() in Spark, not the zip() in Python.
I do not understand what it wrong with the below code, and whether it is correct or not. When I run it, it takes forever. Would you be able to help me with this? Thanks.
spark = SparkSession(sc)
numPartitions = 10
rdd1 = sc.textFile('./dataSet/points.txt',numPartitions).map(lambda x: int(x.split()[0]))
rdd2 = sc.textFile('./dataSet/points.txt',numPartitions).map(lambda x: int(x.split()[1]))
def ele_wise_add(rdd1, rdd2):
rdd3 = rdd1.zip(rdd2).map(lambda x,y: x y)
return rdd3
rdd3 = ele_wise_add(rdd1, rdd2)
print(rdd3.collect())
rdd1 and rdd2 have 10000 numbers each, and below are the first 10 numbers in it.
rdd1 = [47461, 93033, 92255, 33825, 90755, 3444, 48463, 37106, 5105, 68057]
rdd2 = [30614, 61104, 92322, 330, 94353, 26509, 36923, 64214, 69852, 63315]
expected output = [78075, 154137, 184577, 34155, 185108, 29953, 85386, 101320, 74957, 131372]
CodePudding user response:
You do not need .map(lambda x,y: x y)
I suspect. It ran fine for me with that change.
CodePudding user response:
rdd1.zip(rdd2)
would create a single tuple for each pair, so when writing lambda function, you only have x
and not y
. So you'd want to sum(x)
or x[0] x[1]
, not x y
.
rdd1 = spark.sparkContext.parallelize((47461, 93033, 92255, 33825, 90755, 3444, 48463, 37106, 5105, 68057))
rdd2 = spark.sparkContext.parallelize((30614, 61104, 92322, 330, 94353, 26509, 36923, 64214, 69852, 63315))
rdd1.zip(rdd2).map(lambda x: sum(x)).collect()
[78075, 154137, 184577, 34155, 185108, 29953, 85386, 101320, 74957, 131372]