How to remove exaggerated characters from a word?-CodePudding

I'm searching for an algorithm to remove exaggerated characters from a word.

I mean.
A sentence like this:
This is a verrrrryyyy goooood experience!
there are extra characters.
I want to preprocess and remove extra characters to This is a very good experience!
I know that is possible to do something like writing regex or everything to remove them.
But for words like good, buzzer, bazaar that have repeated characters, is there any other way to recognize which repeated word is not exaggerated?
Should I make a dictionary for them?

CodePudding user response：

As SimonN said there is no deterministic solution. But if you want to have some kind of matching (which is not perfect but gives decent results), you can use pyEnchant.

The following steps are performed:

check if the word is in the english dictionary via check()
remove all duplicated letters and perform another dictionary check
get suggestions via the suggest() function and use the first suggestion which contains the same letters as the original word

Here the code for this logic:

import enchant

def removeDuplicates(word):
    result = ""
    prevChar = ""
    for char in word:
        if prevChar == char:
            pass
        else:
            result  = char
        prevChar = char
    return result

def getWordsWithSameLetters(original, suggestions):
    oriLetters = "".join(dict.fromkeys(original))
    sameLetterWords = []
    for sug in suggestions:
        sugLetters = "".join(dict.fromkeys(sug))
        if oriLetters == sugLetters:
            sameLetterWords.append(sug)

    return sameLetterWords


d = enchant.Dict("en_US")

sentence = "This is a verrrrryyyy goooood experience froom schoooool"

for word in sentence.split():
    if d.check(word):
        print(word)
    elif d.check(removeDuplicates(word)):
        print(removeDuplicates(word))
    else:
        suggestions = d.suggest(removeDuplicates(word))
        matchingSug = getWordsWithSameLetters(word, suggestions)
        print(matchingSug[0])

The example input

This is a verrrrryyyy goooood experience froom schoooool

gives the following output (which also shows the mentioned issue in god/good):

This is a very god experience from school

CodePudding user response：

Given a word word, you can remove all adjacent duplicates using ''.join(c for c,_ in groupby(word)).
Using a words list such as the scrabble dictionary or the Unix words file /usr/share/dict/words, build a dict that maps a word-with-no-adjacent-duplicate to the list of all existing words that would result in that pattern when submitted to ''.join(c for c,_ in groupby(word)).
For every word in your sentence, use that dict to find candidates for replacement.
Select the longest candidate, so as to remove as few letters as possible.

from itertools import groupby

scrabble_dict_filename = 'scrabble_dict.txt'

def build_dict():
    d = {}
    with open(scrabble_dict_filename, 'r') as f:
        next(f)
        for line in f:
            word = line.strip().lower()
            if len(word) > 0:
                key = ''.join(c for c,_ in groupby(word))
                d.setdefault(key, []).append(word)
    return d

def simplify_word(word, d):
    mask = [(c, sum(1 for _ in g)) for c,g in groupby(word)]
    key = ''.join(c for c,_ in mask)
    candidates = d.get(key, [])
    masks = [[(c, sum(1 for _ in g)) for c,g in groupby(w)] for w in candidates]
    candidates = [w for w,m in zip(candidates, masks) if len(m) == len(mask) and all(cw == cword and kw <= kword for (cw,kw),(cword,kword) in zip(m, mask))]
    return max(candidates, key=len) if candidates else word

def simplify_sentence(s, d):
    return ' '.join(simplify_word(w, d) for w in s.split())

def main(argv):
    d = build_dict()
    s = 'This is a verrrrryyyy goooood experience froom schoooool wwwiiiittttthh bbuuzzeerr bbaazzaarr'
    if len(argv) > 1:
        s = ' '.join(argv[1:])
    print(simplify_sentence(s,d))

if __name__ == '__main__':
    import sys
    main(sys.argv)

Output:

This is a verry good experience from school with buzzer bazaar

Statistics regarding ambiguous words:

There are 273199 words in the dictionary.
There are 5721 classes of exactly 2 words such as verry/very or bazar/bazaar;
There are 258 classes of exactly 3 words such as ['steeming', 'steming', 'stemming']; those are mostly conjugated verbs.
There are only 15 classes of 4 words: [['aargh', 'aarrgh', 'aarrghh', 'argh'], ['colly', 'coly', 'coolly', 'cooly'], ['done', 'donee', 'donne', 'donnee'], ['garote', 'garotte', 'garrote', 'garrotte'], ['garoted', 'garotted', 'garroted', 'garrotted'], ['garotes', 'garottes', 'garrotes', 'garrottes'], ['garoting', 'garotting', 'garroting', 'garrotting'], ['karoos', 'karos', 'kaross', 'karroos'], ['leeses', 'leses', 'lessees', 'lesses'], ['millionaires', 'millionairess', 'millionnaires', 'millionnairess'], ['reefed', 'refed', 'refeed', 'reffed'], ['tittuped', 'tittupped', 'tituped', 'titupped'], ['tittuping', 'tittupping', 'tituping', 'titupping'], ['trameled', 'tramelled', 'trammeled', 'trammelled'], ['trameling', 'tramelling', 'trammeling', 'trammelling']]
There are no classes of 5 or more words.