I am new to spark, and was trying to count the frequency of each letter in a list of names and then rank the top 10 letters. I am having trouble at the end when building the tuple, can anyone please help?
rdd_1 = sc.parallelize(['Scott', 'Steven', 'Sara', 'Mike', 'Mary', 'Joe', 'Jake'])
letters = rdd_1.flatMap (lambda x: list(x.lower()))
letters.collect()
output for letter is:
['s', 'c', 'o', 't', 't', 's', 't', 'e', 'v', 'e', 'n', 's', 'a', 'r', 'a', 'm', 'i', 'k', 'e', 'm', 'a', 'r', 'y', 'j', 'o', 'e', 'j', 'a', 'k', 'e']
instances1 = letters.map (lambda letr: (letr, 1))
aggCounts1 = instances1.reduceByKey (lambda x, y: x y)
aggCounts1.collect()
output for aggCounts1.collect() is:
[('s', 3), ('r', 2), ('i', 1), ('y', 1), ('e', 5), ('a', 4), ('m', 2), ('j', 2), ('t', 3), ('n', 1), ('k', 2), ('c', 1), ('o', 2), ('v', 1)]
I want to find the top 10 words and then rank them
topWords = aggCounts1.top (10, lambda x : x[1])
topWords[:3]
top 3 words: [('e', 5), ('a', 4), ('s', 3)]
topTen = sc.parallelize(range(10))
This is what I made for the tuple result:
# this is incorrect syntax
result = topTen.map (lambda ltrs,nums: ltrs for ltrs in topWords and nums in topTen (topWords[0], topTen) )
I am trying to get something like this:
[('e', 0), ('a', 1), ('s', 2), ('t', 3), ('r', 4), ('m', 5), ('j', 6), ('k', 7), ('o', 8), ('i', 9)]
CodePudding user response:
You can use zipWithIndex
as last step and then map
accordingly.
See https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.zipWithIndex.html.
Narrow transformation here.
UPDATE
In the right way as you have a list.
Full code:
%python
rdd_1 = sc.parallelize(['Scott', 'Steven', 'Sara', 'Mike', 'Mary', 'Joe', 'Jake'])
letters = rdd_1.flatMap (lambda x: list(x.lower())) letters.collect()
instances1 = letters.map (lambda letr: (letr, 1))
aggCounts1 = instances1.reduceByKey (lambda x, y: x y)
topWords2 = aggCounts1.sortBy(lambda x: (-x[1], x[0])).zipWithIndex().map(lambda x: (x[0][0],x[1]))
topWords2.take(20)