Finding the max _length of word using MapReduce-CodePudding

I need to find all the longest word/words from a txt file using MapReduce. I have written the following code for the mapper and reducer, but it shows the entire dictionary of len(words) as Key and the words as Values. I need help in writing the code to show the result of the max length only and the respective words. Following is my code :

"""mapper.py"""
import sys
> for line in sys.stdin:
>   for word in line.strip().split():
>      print ('%s\t%s' % (len(word), word))



"""reducer.py"""

> import sys results={} for line in sys.stdin:
>     index, value = line.strip().split('\t')
>     if index not in results :
>         results[index] = value
>     else :
>         results[index]  = ' '
>         results[index]  = value

***** I m just stuck on this part to continue the coding to get the max(key) with corresponding words

Input file : How Peace Begins ? Peace begins with saying sorry, Peace begins with not hurting others, Peace begins with honesty ,trust and dedications, Peace begins with showing cooperation and respect. World Peace Begins with Me !

Output expected : The longest word has 11 characters. The words are: dedications cooperation

CodePudding user response：

I am not sure what you are doing with the stdin or why you are importing sys. Also, the sample input file doesn't seem to be in csv format but just a simple text file. As I understand you problem, you want to read an input file, measure the length of each word and report out the length of the maximum word or words and list the words meeting this criteria. With this in mind, this is how I would proceed:

inputFile = r'sampleMapperText.txt'
with open(inputFile, 'r') as f:
    reslt = dict()  #keys = word lengths, values = words of key length
    text = f.read().split('\n')
    for line in text:
        words = line.split()
        for w in words:
            wdlist = reslt.pop(len(w), [])
            wdlist.append(w)
            reslt[len(w)] = wdlist
    maxLen = max(list(reslt.keys()))
    print(f"Max Word Length = {maxLen}, Longest words = {', '.join(reslt[maxLen])}")

Running this code produces:

Max Word Length = 12, Longest words = dedications,

If you insist on separating the process into two separate files. Assuming the two files are in the same directory, I would do it as follows:

The contents of the reducer.py file would be:

# reducer.py 
def getData(filepath: str) -> list([str]):
    with open(filepath, 'r') as f:
        text = f.read().split('\n')
    return text

The contents of the mapper.py file would be:

# mapper.py
from reducer import getData

def mapData(text:list(str)):
    reslt = dict()  #keys = word lengths, values = words of key length
    for line in text:
        words = line.split()
        for w in words:
            wdlist = reslt.pop(len(w), [])
            wdlist.append(w)
            reslt[len(w)] = wdlist
    maxLen = max(list(reslt.keys()))
    print(f"Max Word Length = {maxLen}, Longest words = {', '.join(reslt[maxLen])}")     

inputFile = r'sampleMapperText.txt'
mapData(getData(inputFile))