How to autocorrect misplaced hyphen in a word?-CodePudding

Situation & Problem

1 .

eg:

Say, you have a paragraph.

The word sentence is broken down to sente-nce with a hyphen.

Imagine you have this sample sentence, which is a very long sente-
nce that has a word being broken down with a hyphen.

2 .

How can I detect that word sente-nce is broken down with a hyphen, and correct it into sentence?

note:

Is there any library I can use to do that (prefer Java / Python / any software)?
Using a simple regex to match all (\w)-(\w) & replace with $1$2, wont work in all cases.

eg: Imagine you have a word event-driven, it will become eventdriven, which is undesired.

CodePudding user response：

You need to check if the word belongs to english vocabulary. Find all the matches, for each check if word exists in english vocabulary and if not, then change the word. Something like:

import enchant
voc = enchant.Dict("en_US")

word = "sente-nce"

voc.check(word)

It returns False if it's not a word.

CodePudding user response：

Solution _{^{(may not be the best)}}

usage

@to_use::

put your dictionary in Path path = Paths.get("words_alpha.txt"); <= https://github.com/dwyl/english-words
put your sentence to autoCorrect on in content_TESTING
execute & get output

@note::

depending on the quality of the dictionary, the results may not be good.

@note::

if your words contains "space or newline \n" -> modify the regex in String str_RegexPattern = "([a-zA-Z] )-([a-zA-Z] )";

@note::

this is not fully tested yet

code

package com.ex.main.autoCorrectHypen;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Collections;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/*
@to_use:: 
1. put your dictionary in `Path path = Paths.get("words_alpha.txt");` <= https://github.com/dwyl/english-words
2. put your sentence to autoCorrect on in `content_TESTING`
3. execute & get output 

@note::
depending on the quality of the dictionary, the results may not be good. 
@note::
if your words contains "space or newline \n" -> modify the regex in `String str_RegexPattern = "([a-zA-Z] )-([a-zA-Z] )";`
@note::
this is not fully tested yet
*/

// https://stackoverflow.com/questions/11607270/how-to-check-whether-given-string-is-a-word
// https://github.com/dwyl/english-words
// ~// https://github.com/first20hours/google-10000-english
class Dictionary {
  private static HashSet<String> wordsSet = new HashSet<>();

  public static void initDictionary() throws IOException {
    Path path = Paths.get("words_alpha.txt");
    byte[] readBytes = Files.readAllBytes(path);
    String wordListContents = new String(readBytes, "UTF-8");
    String[] words = wordListContents.split("\r\n"); // @atten: \r\n or \n
    Collections.addAll(wordsSet, words);
  }

  static {
    try {
      initDictionary();
    } catch (IOException e) {
      e.printStackTrace();
    }
  }

  public static boolean contains(String word) { return wordsSet.contains(word); }
}

public class AutoCorrectHypen {

  public static String autoCorrectHypen(String content_ValidateOn) {
    String content_SearchOn = content_ValidateOn;
    String str_RegexPattern = "([a-zA-Z] )-([a-zA-Z] )";
    Pattern pattern = Pattern.compile(str_RegexPattern);
    Matcher matcher = pattern.matcher(content_SearchOn);

    StringBuilder sb_ContentSearchOn = new StringBuilder(content_SearchOn);
    StringBuilder content_Replaced = new StringBuilder();
    int ind_MatchGroupEnd_prev = 0;
    int ind_MatchGroupEnd_curr;
    int ind_MatchGroupStart_curr;
    while (matcher.find()) {
      // 
      ind_MatchGroupStart_curr = matcher.start(0);
      ind_MatchGroupEnd_curr = matcher.end(0);
      String content_BeforeMatchGroup = sb_ContentSearchOn.substring(ind_MatchGroupEnd_prev, ind_MatchGroupStart_curr); // prev end to curr start, not start to end
      content_Replaced.append(content_BeforeMatchGroup);

      // 
      String content_SearchOn_innerMatch_G0 = matcher.group(0);
      String content_SearchOn_innerMatch_G1 = matcher.group(1);
      String content_SearchOn_innerMatch_G2 = matcher.group(2);
      String content_Replaced_innerMatch = autoCorrectHypen_innerMatch(content_SearchOn_innerMatch_G0, content_SearchOn_innerMatch_G1, content_SearchOn_innerMatch_G2);
      content_Replaced.append(content_Replaced_innerMatch);

      // 
      ind_MatchGroupEnd_prev = ind_MatchGroupEnd_curr;
    }
    System.out.println("-------");

    // append the content after the last match group
    String content_AfterLastMatchGroup = sb_ContentSearchOn.substring(ind_MatchGroupEnd_prev, sb_ContentSearchOn.length());
    content_Replaced.append(content_AfterLastMatchGroup);

    return content_Replaced.toString();
  }

  protected static String autoCorrectHypen_innerMatch(String content_SearchOn_innerMatch_G0, String content_SearchOn_innerMatch_G1, String content_SearchOn_innerMatch_G2) {
    System.out.printf("> %s; %s; %s; %n", content_SearchOn_innerMatch_G0, content_SearchOn_innerMatch_G1, content_SearchOn_innerMatch_G2);
    String content_Replaced_innerMatch = null;
    // @atten: order of the if stmt matters
    if (Dictionary.contains(content_SearchOn_innerMatch_G0)) {
      content_Replaced_innerMatch = content_SearchOn_innerMatch_G0;
      System.out.printf(">> %s: %n%s %n", "whole word - with hypen, G0", content_Replaced_innerMatch);
    } else if (Dictionary.contains(content_SearchOn_innerMatch_G1   content_SearchOn_innerMatch_G2)) {
      content_Replaced_innerMatch = content_SearchOn_innerMatch_G1   content_SearchOn_innerMatch_G2;
      System.out.printf(">> %s: %n%s %n", "whole word - remove hypen, G1   G2", content_Replaced_innerMatch);
    } else if (Dictionary.contains(content_SearchOn_innerMatch_G1) && Dictionary.contains(content_SearchOn_innerMatch_G2)) {
      content_Replaced_innerMatch = content_SearchOn_innerMatch_G0;
      System.out.printf(">> %s: %n%s %n", "whole word - with hypen, G1 && G2", content_Replaced_innerMatch);
    } else {
      content_Replaced_innerMatch = content_SearchOn_innerMatch_G0;
      System.err.println(">> No such word");
    }
    return content_Replaced_innerMatch;
  }

  //################################################################################################

  static final String content_TESTING_Simple = ""
                                                 "Check the word sente-nce, event-driven, family-owned, chocolate-covered, anti-clockwise.\n"
                                                 "samp-le, diff-erence, what-do-you-mean, how-ever, be-cause, other-wise, pill-ow";

  static final String content_TESTING = ""
                                          "Imagine you have this sample sentence, which is a very long sente-\n"
                                          "nce that has a word being broken down with a hyphen. \n"
                                          "\n"
                                          "Check the word sente-nce, event-driven, family-owned, chocolate-covered, anti-clockwise.\n"
                                          "";

  public static void main(String[] args) throws Exception {
    System.out.println(autoCorrectHypen(content_TESTING_Simple)); // 
  }

}

input

Check the word sente-nce, event-driven, family-owned, chocolate-covered, anti-clockwise.
samp-le, diff-erence, what-do-you-mean, how-ever, be-cause, other-wise, pill-ow

output

Check the word sentence, event-driven, family-owned, chocolate-covered, anticlockwise.
sample, difference, what-do-you-mean, however, because, otherwise, pillow

Situation & Problem

Solution (may not be the best)

usage

code

Solution _{^{(may not be the best)}}