I'm searching for an algorithm to remove exaggerated characters from a word.
I mean.
A sentence like this: This is a verrrrryyyy goooood experience!
there are extra characters.
I want to preprocess and remove extra characters to This is a very good experience!
I know that is possible to do something like writing regex or everything to remove them.
But for words like good, buzzer, bazaar
that have repeated characters, is there any other way to recognize which repeated word is not exaggerated?
Should I make a dictionary for them?
CodePudding user response:
As SimonN said there is no deterministic solution. But if you want to have some kind of matching (which is not perfect but gives decent results), you can use pyEnchant.
The following steps are performed:
- check if the word is in the english dictionary via
check()
- remove all duplicated letters and perform another dictionary check
- get suggestions via the
suggest()
function and use the first suggestion which contains the same letters as the original word
Here the code for this logic:
import enchant
def removeDuplicates(word):
result = ""
prevChar = ""
for char in word:
if prevChar == char:
pass
else:
result = char
prevChar = char
return result
def getWordsWithSameLetters(original, suggestions):
oriLetters = "".join(dict.fromkeys(original))
sameLetterWords = []
for sug in suggestions:
sugLetters = "".join(dict.fromkeys(sug))
if oriLetters == sugLetters:
sameLetterWords.append(sug)
return sameLetterWords
d = enchant.Dict("en_US")
sentence = "This is a verrrrryyyy goooood experience froom schoooool"
for word in sentence.split():
if d.check(word):
print(word)
elif d.check(removeDuplicates(word)):
print(removeDuplicates(word))
else:
suggestions = d.suggest(removeDuplicates(word))
matchingSug = getWordsWithSameLetters(word, suggestions)
print(matchingSug[0])
The example input
This is a verrrrryyyy goooood experience froom schoooool
gives the following output (which also shows the mentioned issue in god/good):
This is a very god experience from school
CodePudding user response:
- Given a word
word
, you can remove all adjacent duplicates using''.join(c for c,_ in groupby(word))
. - Using a words list such as the scrabble dictionary or the Unix
words
file/usr/share/dict/words
, build adict
that maps a word-with-no-adjacent-duplicate to the list of all existing words that would result in that pattern when submitted to''.join(c for c,_ in groupby(word))
. - For every word in your sentence, use that dict to find candidates for replacement.
- Select the longest candidate, so as to remove as few letters as possible.
from itertools import groupby
scrabble_dict_filename = 'scrabble_dict.txt'
def build_dict():
d = {}
with open(scrabble_dict_filename, 'r') as f:
next(f)
for line in f:
word = line.strip().lower()
if len(word) > 0:
key = ''.join(c for c,_ in groupby(word))
d.setdefault(key, []).append(word)
return d
def simplify_word(word, d):
mask = [(c, sum(1 for _ in g)) for c,g in groupby(word)]
key = ''.join(c for c,_ in mask)
candidates = d.get(key, [])
masks = [[(c, sum(1 for _ in g)) for c,g in groupby(w)] for w in candidates]
candidates = [w for w,m in zip(candidates, masks) if len(m) == len(mask) and all(cw == cword and kw <= kword for (cw,kw),(cword,kword) in zip(m, mask))]
return max(candidates, key=len) if candidates else word
def simplify_sentence(s, d):
return ' '.join(simplify_word(w, d) for w in s.split())
def main(argv):
d = build_dict()
s = 'This is a verrrrryyyy goooood experience froom schoooool wwwiiiittttthh bbuuzzeerr bbaazzaarr'
if len(argv) > 1:
s = ' '.join(argv[1:])
print(simplify_sentence(s,d))
if __name__ == '__main__':
import sys
main(sys.argv)
Output:
This is a verry good experience from school with buzzer bazaar
Statistics regarding ambiguous words:
- There are 273199 words in the dictionary.
- There are 5721 classes of exactly 2 words such as
verry/very
orbazar/bazaar
; - There are 258 classes of exactly 3 words such as
['steeming', 'steming', 'stemming']
; those are mostly conjugated verbs. - There are only 15 classes of 4 words:
[['aargh', 'aarrgh', 'aarrghh', 'argh'], ['colly', 'coly', 'coolly', 'cooly'], ['done', 'donee', 'donne', 'donnee'], ['garote', 'garotte', 'garrote', 'garrotte'], ['garoted', 'garotted', 'garroted', 'garrotted'], ['garotes', 'garottes', 'garrotes', 'garrottes'], ['garoting', 'garotting', 'garroting', 'garrotting'], ['karoos', 'karos', 'kaross', 'karroos'], ['leeses', 'leses', 'lessees', 'lesses'], ['millionaires', 'millionairess', 'millionnaires', 'millionnairess'], ['reefed', 'refed', 'refeed', 'reffed'], ['tittuped', 'tittupped', 'tituped', 'titupped'], ['tittuping', 'tittupping', 'tituping', 'titupping'], ['trameled', 'tramelled', 'trammeled', 'trammelled'], ['trameling', 'tramelling', 'trammeling', 'trammelling']]
- There are no classes of 5 or more words.