How do I count word occurrences in a csv file?-CodePudding

I have a CSV file that I need to read and display the number of occurrences of each word, the application should only count words that have more than one letter and not alphanumerical also turned to lowercase.

This is what I have right now and I'm stuck at this and have no ideas where to go from this.


        CSVReader reader = new CSVReaderBuilder(new FileReader(pathFile1)).withSkipLines(1).build();

        Map<String, Integer> frequency = new HashMap<>();
        String[] line;


        while ((line = reader.readNext()) != null) {
            String words = line[1];

            words = words.replaceAll("\\p{Punct}", " ").trim();
            words = words.replaceAll("\\s{2}", " ");
            words = words.toLowerCase();

            if (frequency.containsKey(words)) {
                frequency.put(words, frequency.get(words)   1);
            } else {
                frequency.put(words, 0);
            }


        }


    }

I am trying to read the second index in the array list of the csv, which is line[1] , This is where the text of the document is located.

I have replaced all punctuation with spaces and trimmed it, also if there are more than 2 spaces I have replaced those with 1 and made it lowercase.

The output I am trying to achieve is:

Title of document: XXXX

Word: example, Value: 5

CodePudding user response：

Your solution does not look that bad. But for initialization i would replace

frequency.put(words, 0);

with

frequency.put(words, 1);

Since I am mising your Input file i created a dummy that works fine.

    Map<String, Integer> frequency = new HashMap<>();
    List<String> csvSimulation = new ArrayList<String>();
    csvSimulation.add("test");
    csvSimulation.add( "marvin");
    csvSimulation.add("aaaaa");
    csvSimulation.add("nothing");
    csvSimulation.add("test");
    csvSimulation.add("test");
    csvSimulation.add("aaaaa");
    csvSimulation.add("stackoverflow");
    csvSimulation.add("test");
    csvSimulation.add("bread");

    Iterator<String> iterator = csvSimulation.iterator();


    while(iterator.hasNext()){
        String words = iterator.next();
        words = words.toLowerCase();
        if (frequency.containsKey(words)) {
            frequency.put(words, frequency.get(words)   1);
        } else {
            frequency.put(words, 1);
        }

    }

    System.out.println(frequency);

Are you sure that accessing line[1] in an loop while iteration is correct? The correct reading of the input seems to be the problem for me. Without seeing your CSV file i however cant help you any further.

CodePudding user response：

You can use a regex match to verify that words fits your criteria before adding it to your HashMap, like so:

if (words.matches("[a-z]{2,}"))

[a-z] specifies only lowercase alpha chars
{2,} specifies "Minimum of 2 occurrences, maximum of <undefined>"

Though, given you're converting punctuation to spaces, it sounds like you could have multiple words in line[1]. If you want to gather counts of multiple words across multiple lines, then you may want to split words on the space char, like so:

for (String word : words.split(" ")) {
  if (word.matches("[a-z]{2,}")) {
    // Then use your code for checking if frequency contains the term,
    //   but use `word` instead of `words`
  }
}