How to steps differences reduce in Hadoop?
I have a problem with understand Hadoop. I have two files and first I did a join between those files. One file is about countries and the other is about client in each country.
Example, clients.csv:
Bertram Pearcy ,bueno,SO
Steven Ulman ,regular,ZA
Countries.csv
Name,Code
Afghanistan,AF
Ã…land Islands,AX
Albania,AL
…
I did one map reduce that give me how many “good” (bueno) clients have a country (ZA, SO) and with countries.csv I know with country we are talking.
I programmed:
def steps(self):
# ordenamos las operaciones para su ejecución.
return [
MRStep(mapper=self.mapper
,reducer=self.reducer),
MRStep(mapper=self.mapper1
,combiner=self.combiner_cuenta_palabras
,reducer=self.reducer2
),
]
The result of my map/reduce is:
["South Georgia and the South Sandwich Islands"] 1
["South Sudan"] 1
["Spain"] 3
Now, I would like to know which one is the best.
I added one reduce more.
def reducer3(self, _, values):
yield _, max (values)
def steps(self):
# ordenamos las operaciones para su ejecución.
return [
MRStep(mapper=self.mapper
,reducer=self.reducer),
MRStep(mapper=self.mapper1
,combiner=self.combiner_cuenta_palabras
,reducer=self.reducer2),
MRStep(#mapper=self.mapper3,
reducer=self.reducer3
#,reducer=self.reducer3
),
]
But I have the same answer than without that reducer
I try to use one map/reduce program adding another reduce. It that does not work.
With my first reduce I got:
A, 10
C, 2
D, 5
Now, I would like to use that result I get: A, 10
Additional comment:
INPUT [Fille1] [File2] => enter image description here
MAP/REDUCE => OUT
Now, I need that with additional map/reduce ( and I would like to use what I did) get another answers.
First) For instance, one and only one answer. Example: 3 Spain
Second) All with the best or bigger number, 3 Spain
and 3 Guan
.
I try to use:
def reducer3(self, _, values):
yield _, max (values)
And I add,
def steps(self):
# ordenamos las operaciones para su ejecución.
return [
MRStep(mapper=self.mapper
,reducer=self.reducer),
MRStep(mapper=self.mapper1
,combiner=self.combiner_cuenta_palabras
,reducer=self.reducer2),
MRStep(reducer=self.reducer3
),
]
But I still have the same result. I Know that REDUCER3 is using because if I write max(values) 1000
give me the same result but with number 1001
, 1003
CodePudding user response:
Your reducer is getting 3 distinct keys, therefore you're finding the max of each, and values
only has one element (try printing its length... ). Therefore, you get 3 results.
You need a third mapper that returns (None, f'{key}|{value})
for example, then all records will be sent to one reducer, where you can then iterate, parse, and aggregate the results
def reducer3(self, _, values):
_max = float('-inf')
k_out = None
for x in values:
k, v = x.split('|')
if int(v) > _max:
_max = v
k_out = k
yield k_out, _max
That'll only return one result for all values. If you want to capture equal max values, I think you'll need to iterate over the list more than once, then yield within a loop of found max elements