Map Reduce to read a text file using Python-CodePudding

I'm trying to write a MapReduce program that can read an input file.But I don't really know how to use it in a MapReduce program in python.

Can someone give me a code snippet of it? I have tried the following Code to read a file in python. I have already pushed the file on HDFS file system before reading.

f = open('/usr/total.txt',"r")
g_total = int(f.readline())

CodePudding user response：

Just to give you a quick background on MapReduce - MapReduce is a processing technique and a program model for distributed computing based. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. source - https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm

Here's an example map reduce program in python for your reference - https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

CodePudding user response：

calculate [...] percentage so for that i need total number of rows

Calculating totals or averages isn't really a good use case for mapreduce. You need to force all data to one reducer with a common key, then you can sum, count, and divide totals to get percentages. For example, the mapper would output <null,year>, then the reducer would sum all yearly counts first (use a Counter object, for example), then you can sum all years together to get total number of records, since all data is available now to one reducer. Once you have that, divide by each yearly count to output <year,total_count/yearly_count>, giving a percent for each year.

Worth pointing out that Pyspark or Hive can easily do the same, so you don't "need" to write MapReduce, but this is how those work behind the scenes