Home > Net >  How can I properly count the number of sentences from the file in Java using the split method?
How can I properly count the number of sentences from the file in Java using the split method?

Time:10-26

How I can count accurately the number of sentences from the file?

I have a text in my file. There are 7 sentences but my code shows that there are 9 sentences.

    String path = "C:/CT_AQA - Copy/src/main/resources/file.txt";
    BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(path)));
    String line;
    int countWord = 0;
    int sentenceCount = 0;
    int characterCount = 0;
    int paragraphCount = 0;
    int countNotLetter = 0;
    int letterCount = 0;
    int wordInParagraph = 0;
    List<Integer> wordsPerParagraph = new ArrayList<>();

    while ((line = br.readLine()) != null) {
        if (line.equals("")) {
            paragraphCount  ;
            wordsPerParagraph.add(wordInParagraph);
            System.out.printf("In %d paragraph there are %d words\n", paragraphCount, wordInParagraph);
            wordInParagraph = 0;
        } else {

            characterCount  = line.length();
            String[] wordList = line.split("[\\s—]");
            countWord  = wordList.length;
            wordInParagraph  = wordList.length;

            String[] letterList = line.split("[^a-zA-Z]");
            countNotLetter  = letterList.length;
            String[] sentenceList = line.split("[.:]");
            sentenceCount  = sentenceList.length;
        }
        letterCount = characterCount - countNotLetter;
    }

    if (wordInParagraph != 0) {
        wordsPerParagraph.add(wordInParagraph);
    }

    br.close();
    System.out.println("The amount of words are "   countWord);
    System.out.println("The amount of sentences are "   sentenceCount);
    System.out.println("The amount of paragraphs are "   paragraphCount);
    System.out.println("The amount of letters are "   letterCount);
   

CodePudding user response:

Your code looks working, although it doesn't follow best practices everywhere.

I suspect the root cause of getting the wrong answer is an inaccurate regex calculating the end of a sentence. Your code counts sentences ending in a dot or colon. The problem is in this line:

String[] sentenceList = line.split("[.:]");

But the colon is not the end of a sentence and besides sentences also end with other characters (exclamation and question marks, ellipsis). This pattern is more accurate in my assessment:

"[!?.] (?=$|\\s)"

And show the content of your file on which you get the wrong result. Then it'll be possible to be convinced of my assumption.

Full code counting only the number of sentences in your file:

int sentenceCount = 0;

while ((line = br.readLine()) != null) {
  if (!"".equals(line)) {
    String[] sentencesArray = line.split("[!?.] (?=$|\\s)");
    sentenceCount  = sentencesArray.length;
  }
}

br.close();
System.out.println("The amount of sentences are "   sentenceCount);

CodePudding user response:

You probably are picking up trailing white space on your sentences which is adding extra values to your array. You can remove the whitespace from the line before you split it for your sentence using replaceAll("\\s ", "").

The changed code would look like this:

String[] sentenceList = line.replaceAll("\\s ","").split("[.:]");

I did not change what you define a sentence however, ! and ? could obviously be sentence delimiters as well.

  • Related