Splitting the tokens with Java StringTokenizer-CodePudding

I have a data set that looks like this:

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL
etc.

and the following code:

public class LotteryCount {

    /**
     * Mapper which extracts the lottery number and passes it to the Reducer with a single occurrence
     */
    public static class LotteryMapper extends Mapper<Object, Text, IntWritable, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private IntWritable lotteryKey;

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            StringTokenizer itr = new StringTokenizer(value.toString(), ",");
            while (itr.hasMoreTokens()) {
                lotteryKey.set(Integer.valueOf(itr.nextToken()));
                context.write(lotteryKey, one);
            }
        }
    }

    /**
     * Reducer to sum up the occurrence
     */
    public static class LotteryReducer
            extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
        IntWritable result = new IntWritable();

        public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;

            for (IntWritable val : values) {
                sum  = val.get();
            }

            result.set(sum);
            context.write(key, result);
        }
    }
}

It is actually the word count from the official apache hadoop documentation, just a bit customized to my data set.

I get the following error:

Caused by: java.lang.NumberFormatException: For input string: "2005-01-04"

I am just interested in counting the occurrences for each individual drawn lottery number. How can I do this by using the StringTokenizer from my code? I know that I have to split the whole row because the tokenizer is "fed" with the whole. How can I take the lotterynumbers, split them and then count?

Thank you in advance

CodePudding user response：

First problem - you'll need to remove the header of your file before passing to MapReduce.

Second - you have no commas in your shown dataset, so "," should not be given to StringTokenizer

Next - Not all your tokens are Integers, so blindly calling Integer.valueOf(itr.nextToken()) will not work. The first column is a date. You can call itr.nextToken() before the loop to discard the date, but then you need to handle the NULL at the end.

Ultimately, the mapper doesn't need to parse anything. You can also count strings in the reducer.

CodePudding user response：

I am just interested in counting the occurrences for each individual drawn lottery number. How can I do this by using the StringTokenizer from my code? I know that I have to split the whole row because the tokenizer is "fed" with the whole. How can I take the lotterynumbers, split them and then count?

The data sample you posted is tab-delimited:

drawdate    lotterynumbers  meganumber  multiplier
2005-01-04  03 06 07 12 32  30            NULL
2005-01-07  02 08 14 15 51  38            NULL

Here's a simple example, and a few notes:

This uses the first line of your sample data as line, including the tab characters separating the data fields, just like you posted.
It uses a StringTokenizer with the token separator defined as as a single tab character (\t)
The program calls hasMoreTokens() until all tokens are seen, printing each one along the way.
The output includes left right brackets to show the boundary of each token. For example, the "30" has a trailing space character that wouldn't be noticeable without using [] characters, same with leading whitesapce in front of "NULL".

String line = "2005-01-04   03 06 07 12 32  30            NULL";
StringTokenizer tokenizer = new StringTokenizer(line, "\t");

while (tokenizer.hasMoreTokens()) {
    String token = tokenizer.nextToken();
    System.out.println("token: ["   token   "]");
}

Here's the output:

token: [2005-01-04 ]
token: [03 06 07 12 32 ]
token: [30 ]
token: [          NULL]

You could take this approach, processing all lines, tokenizing on tab character, and use the 2nd token as your "lotterynumbers" data to do what you like.