Sum and divide elements of a RDD in pyspark-CodePudding

I am trying to sum all the elements of an RDD and then divide it by the number of elements. I was able to solve it but using different lines. However I would like to do it just with a single line using RDD operations.

The RDD is for example:

rdd_example = [(eliana,1),(peter,2),(andrew,3),(paul,4),(jhon,5)]

The first step is to extract just numbers using the method map with a lambda:

numbers = rdd_example.map(lambda x: x[1])

The output is:

numbers = [1,2,3,4,5]

Then the sum of all the elements, using the method reduce:

from operator import add
sum = numbers.reduce(add)

Then to count the elements another variable is created, using the method count:

number_elem = rdd_example.count()

Then a division is done to obtain the result:

result = sum/number_elem

I would like to do all of it using just a single line, with a single variable.

CodePudding user response：

Use fold with which you can aggregate both the count and sum in one go:

cnt, total = rdd_example.fold((0, 0), lambda res, x: (res[0]   1, res[1]   x[1]))

print(total / cnt)
# 2.5

Notice in the call, we use a tuple to store the count and sum:

rdd_example.fold((0, 0), lambda res, x: (res[0]   1, res[1]   x[1]))
#                 ^  ^                   ^^^^^^^^^^  ^^^^^^^^^^^^^^
#                 ^  init sum        add 1 to count / add value to sum
#                 init count

CodePudding user response：

For a single-line solution, observe that you are computing the mean (average) of the numbers. PySpark already has a mean() method:

rdd_example = sc.parallelize([("eliana",1),("peter",2),("andrew",3),("paul",4),("jhon",5)])
result = rdd_example.map(lambda x: x[1]).mean()
print(result)
# output: 3.0