Home > OS >  Hoc to return number of films for each note interval ? (Hadoop MapReduce Python 3)
Hoc to return number of films for each note interval ? (Hadoop MapReduce Python 3)

Time:12-10

I am a novice on Hadoop MapReduce and I am trying to return the number of films for each rating interval from this dataset : "title.ratings.tsv" dataset. I have an issue : I am not able to show the intervals. Here is my code :

`

from mrjob.job import MRJob
from mrjob.step import MRStep

class MovieCountPerRating(MRJob):

    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_ratings,
            reducer=self.reducer_count_ratings)
        ]

    # Mapper
    def mapper_get_ratings(self, _, line):
        (tconst, averageRating, numVotes ) = line.split('\t')
        yield averageRating, 1

    # Reducer
    def reducer_count_ratings(self, key, values):
        yield key, sum(values)


if __name__ == '__main__':
    MovieCountPerRating.run()

`

Do you have an idea to how I can return the number of movies for each interval of ratings ([0,1], ]1,2], ]2,3], ]3,4], ]4,5], ]5,6], ]6,7], ]7,8], ]8,9], ]9,10]) ?

I codes a map and a reduce function that work well on Ubuntu but I can't seem to find a way to show the intervals of ratings

CodePudding user response:

In order to show the intervals for the ratings, you can simply use the round function in Python to round each rating to the nearest interval. For example, if a rating is 4.6, it will be rounded to 4.5 and will fall under the interval ]4,5]. Here is an example of how you can modify your code to show the intervals:

from mrjob.job import MRJob from mrjob.step import MRStep

class MovieCountPerRating(MRJob):

def steps(self):
    return [
        MRStep(mapper=self.mapper_get_ratings,
        reducer=self.reducer_count_ratings)
    ]

# Mapper
def mapper_get_ratings(self, _, line):
    (tconst, averageRating, numVotes ) = line.split('\t')
    # Round the rating to the nearest interval
    rating_interval = round(float(averageRating) * 2) / 2
    yield rating_interval, 1

# Reducer
def reducer_count_ratings(self, key, values):
    yield key, sum(values)

if name == 'main': MovieCountPerRating.run()

This code will round each rating to the nearest interval and then emit the interval as the key and 1 as the value. In the reducer, the values for each interval will be summed to give the total number of movies for each interval.

  • Related