I am trying to use MapReduce principles with Python to first Map segregated list of word from text with their number of characters, then reduce to display the longest words and its number of characters.
See visual explanation Input is a simple text.txt file with 3 sentences. See below my Mapper:
#MAPPER
#!/usr/bin/python
import sys
for line in sys.stdin:
for word in line.strip().split():
print(str(len(word)) '\t' word)
Thanks for helping for the Reducer. Required outcome is "The longest words in this text are xxxxxxxxx and yyyyyyyy with xx characters"
CodePudding user response:
This does what you ask. This keeps only the top 5 words, which makes it a lot easier to sort. The "print" in the loop is only for debugging and can be removed.
import sys
top5 = []
for line in sys.stdin:
for word in line.strip().split():
top5.append( (len(word),word) )
top5.sort( reverse=True )
top5 = top5[:5]
print(top5)
print( "Top 5 words by length are:" )
print(top5)