Rank letter counts in pyspark into RDD-CodePudding

I am new to spark, and was trying to count the frequency of each letter in a list of names and then rank the top 10 letters. I am having trouble at the end when building the tuple, can anyone please help?

rdd_1 = sc.parallelize(['Scott', 'Steven', 'Sara', 'Mike', 'Mary', 'Joe', 'Jake'])

letters = rdd_1.flatMap (lambda x: list(x.lower()))

letters.collect()

output for letter is:

['s', 'c', 'o', 't', 't', 's', 't', 'e', 'v', 'e', 'n', 's', 'a', 'r', 'a', 'm', 'i', 'k', 'e', 'm', 'a', 'r', 'y', 'j', 'o', 'e', 'j', 'a', 'k', 'e']

instances1 = letters.map (lambda letr: (letr, 1))
aggCounts1 = instances1.reduceByKey (lambda x, y: x   y)
aggCounts1.collect()

output for aggCounts1.collect() is:

[('s', 3), ('r', 2), ('i', 1), ('y', 1), ('e', 5), ('a', 4), ('m', 2), ('j', 2), ('t', 3), ('n', 1), ('k', 2), ('c', 1), ('o', 2), ('v', 1)]

I want to find the top 10 words and then rank them

topWords = aggCounts1.top (10, lambda x : x[1])
topWords[:3]

top 3 words: [('e', 5), ('a', 4), ('s', 3)]

topTen = sc.parallelize(range(10))

This is what I made for the tuple result:

# this is incorrect syntax
result = topTen.map (lambda ltrs,nums: ltrs for ltrs in topWords and nums in topTen (topWords[0], topTen) )

I am trying to get something like this:

[('e', 0), ('a', 1), ('s', 2), ('t', 3), ('r', 4), ('m', 5), ('j', 6), ('k', 7), ('o', 8), ('i', 9)]

CodePudding user response：

You can use zipWithIndex as last step and then map accordingly.

See https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.zipWithIndex.html.

Narrow transformation here.

UPDATE

In the right way as you have a list.

Full code:

%python

rdd_1 = sc.parallelize(['Scott', 'Steven', 'Sara', 'Mike', 'Mary', 'Joe', 'Jake']) 
letters = rdd_1.flatMap (lambda x: list(x.lower())) letters.collect()

instances1 = letters.map (lambda letr: (letr, 1)) 
aggCounts1 = instances1.reduceByKey (lambda x, y: x   y) 

topWords2 = aggCounts1.sortBy(lambda x: (-x[1], x[0])).zipWithIndex().map(lambda x: (x[0][0],x[1])) 
topWords2.take(20)