My problem is sorting words base on the frquency in a file.
My input is in this format:
Word: Frequency:
coffee 6
good 9
I 50
morning 21
happy 9
The expected output should be in this format:
Frequency: Word:
50 I
21 morning
9 good
9 happy
6 coffee
My initial plan was to set the frequency as the key and the word as the value, but i am not sure if the duplicate key (9) will cause a conflict between the value (good & happy).
public static class Map extends Mapper<Text, Text, Text, Text> {
private Text frequency = new Text();
private Text word = new Text();
public void map(Text value, Text key, Context context) throws IOException, InterruptedException {
word.set(value);
frequency.set(key);
context.write(key, value);
If the duplicate key does not cause a problem, is it correct to run the input through the above code? I understand that Hadoop will automatically sort the key, but not sure if it will be in descending or ascending order. my aim is to achieve a descending order.
CodePudding user response:
Duplicate keys don't cause problems in mappers. The words will simply be grouped together in the reducer.
The main problem is that reducers for distinct keys can run in parallel, therefore you can't guarantee ordering of their output.
My suggestion would be to use Spark, Pig, or Hive rather than MapReduce for simply sorting of data (or hadoop fs -cat file.txt | sort -n
command in Unix for smaller HDFS files after running unsorted WordCount).
To fix the reducer problem, you need to force all data to one reducer (use NullWritable as the mapper key, with key,value
in Text
value output). Then in the reducer, use data structure like TreeMap to insert and sort values iterable. Then loop over that sorted data to write its key/values to the context