I am trying to sum all the elements of an RDD and then divide it by the number of elements. I was able to solve it but using different lines. However I would like to do it just with a single line using RDD operations.
The RDD is for example:
rdd_example = [(eliana,1),(peter,2),(andrew,3),(paul,4),(jhon,5)]
The first step is to extract just numbers using the method map with a lambda:
numbers = rdd_example.map(lambda x: x[1])
The output is:
numbers = [1,2,3,4,5]
Then the sum of all the elements, using the method reduce:
from operator import add
sum = numbers.reduce(add)
Then to count the elements another variable is created, using the method count:
number_elem = rdd_example.count()
Then a division is done to obtain the result:
result = sum/number_elem
I would like to do all of it using just a single line, with a single variable.
CodePudding user response:
Use fold
with which you can aggregate both the count and sum in one go:
cnt, total = rdd_example.fold((0, 0), lambda res, x: (res[0] 1, res[1] x[1]))
print(total / cnt)
# 2.5
Notice in the call, we use a tuple to store the count and sum:
rdd_example.fold((0, 0), lambda res, x: (res[0] 1, res[1] x[1]))
# ^ ^ ^^^^^^^^^^ ^^^^^^^^^^^^^^
# ^ init sum add 1 to count / add value to sum
# init count
CodePudding user response:
For a single-line solution, observe that you are computing the mean (average) of the numbers. PySpark already has a mean()
method:
rdd_example = sc.parallelize([("eliana",1),("peter",2),("andrew",3),("paul",4),("jhon",5)])
result = rdd_example.map(lambda x: x[1]).mean()
print(result)
# output: 3.0