I'm new to pyspark, and have been trying to figure this out for hours.
Currently, my RDD looks like this:
[['74', '85', '123'], ['73', '84', '122'], ['72', '83', '121'], ['70', '81', '119'], ['70', '81', '119'], ['69', '80', '118'], ['70', '81', '119'], ['70', '81', '119'], ['76', '87', '125'], ['76', '87', '125']]
I want it to look like this (where all the entries are integers):
[[74, 85, 123], [73, 84, 122], [72, 83, 121], [70, 81, 119], [70, 81, 119], [69, 80, 118], [70, 81, 119], [70, 81, 119], [76, 87, 125], [76, 87, 125]]
The closest I've gotten was by using flatMap to turn in into a 1D array and then converting the entries into integers. However, I am hoping to process the integers three at a time (calculate the sum and average of the entries 3 at a time), and I figured keeping it in a 2-d array would be the easiest way to do that. I also tried list comprehensions but they don't seem to work since it isn't a list. Any help would be greatly appreciated!
CodePudding user response:
Update
Before performing the below operations, you can use map
and collect
to convert your RDD into list.
rdd = spark.sparkContext.parallelize(data)
list_string = rdd.map(list).collect()
Using list comprehension is actually quick and efficient enough to convert all your strings to integers. Practising it more and you will know the way to deal with it.
list_value = [[int(i) for i in list_] for list_ in list_string]
print(list_value)
[[74, 85, 123], [73, 84, 122], [72, 83, 121], [70, 81, 119], [70, 81, 119], [69, 80, 118], [70, 81, 119], [70, 81, 119], [76, 87, 125], [76, 87, 125]]
The same goes for summing and averaging in 2D array.
list_sum = [sum(vector) for vector in list_value]
list_sum = [sum(vector)/len(vector) for vector in list_value]
Or better, just use NumPy to do the trick.
array = np.array(list_value)
np.sum(array, axis = 1)
Out[174]: array([282, 279, 276, 270, 270, 267, 270, 270, 288, 288])
np.average(array, axis=1)
Out[175]: array([94., 93., 92., 90., 90., 89., 90., 90., 96., 96.])
For speed comparison, I've created a list and an array of (1000,3). Hope this gives you some clear insight about their efficiency.
%timeit np.sum(array, axis=1)
%timeit [sum(vector) for vector in list_value]
20.3 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
167 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.average(array, axis=1)
%timeit [sum(vector)/len(vector) for vector in list_value]
29.3 µs ± 536 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
256 µs ± 23.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
For small sets of 2D list, however, it is faster than using NumPy array.