def naive_word_counting(wordsList):
words = {}
for word in wordsList:
if word in words:
words[word] = 1
else:
words[word] = 1
return sorted(words.items(), key=lambda x: x[1], reverse=True)
def map_reduce_word_counting(wordsList):
sparkRDD = sparkEngine.sparkContext.parallelize(wordsList)
def map_function(word):
return (word, 1)
def reduce_function(a, b):
return a b
counts = sparkRDD.map(map_function).reduceByKey(reduce_function)
return counts.sortBy(lambda x: x[1], ascending=False).collect()
On a list of 4.141.822 words, naive_word_counting runs
in 0.61s while map_reduce_word_counting
runs in 3.01s. I expected spark to be faster than the native way, but it is really significantly slower.
Why is this so? Is there something that I haven't understood?
CodePudding user response:
How did you measure execution time of this part only with word count? Remember that Spark takes some time to create the context before the main logic is executed.
Please find another example how to calculate word count: https://bigdata-etl.com/configuration-of-apache-spark-scala-and-intellij/