Home > database >  PySpark RDD: Manipulating Inner Array
PySpark RDD: Manipulating Inner Array

Time:02-01

I have a dataset (for example)

sc = SparkContext()
x =  [(1, [2, 3, 4, 5]), (2, [2, 7, 8, 10])]
y = sc.parallelize(x)
print(y.take(1))

The print statement returns [(1, [2, 3, 4, 5])]

I now need to multiply everything in the sub-array by 2 across the RDD. Since I have already parallelized, I can't further break down "y.take(1)" to multiply [2, 3, 4, 5] by 2.

How can I essentially isolate the inner array across my worker nodes to then do the multiplication?

CodePudding user response:

I think you can use map with a lambda function:

y = sc.parallelize(x).map(lambda x: (x[0], [2*t for t in x[1]]))

Then y.take(2) returns:

[(1, [4, 6, 8, 10]), (2, [4, 14, 16, 20])]
  • Related