I want to remove repeating characters from a sentence but make it so that the words still retain its meaning (if it has any). For example : I'm so haaappppyyyy about offline school
to I'm so happy about offline school
. See, haaappppyyyy
became happy
and offline & school
stay the same instead becoming ofline & schol
I've tried two solutions, using RE and itertools, but none really fits for what I'm searching for
Using Regex :
tweet = 'I'm so haaappppyyyy about offline school'
repeat_char = re.compile(r"(.)\1{1,}", re.IGNORECASE)
tweet = repeat_char.sub(r"\1\1", tweet)
tweet = re.sub("(.)\\1{2,}", "\\1", tweet)
output :
I'm so haappyy about offline school #it makes 2 chars for every repating chars
using itertools :
tweet = 'I'm so happy about offline school'
tweet = ''.join(ch for ch, _ in itertools.groupby(tweet))
output :
I'm so hapy about ofline schol
How can I fix this? should I make a lists of words I want to exclude?
In addition, I want it to also be able to reduce some words that's in a pattern to it's base form. For example :
wkwk (base form)
wkwkwkwk
wkwkwkwkwkwkwk
I want to make the second and the third word into the first word, the base form
CodePudding user response:
You can combine regex and NLP here by iterating over all words in a string, and once you find one with identical consecutive letters reduce them to max 2 consecutive occurrences of the same letters and run the automatic spellcheck to fix the spelling.
See an example Python code:
import re
from textblob import TextBlob
from textblob import Word
rx = re.compile(r'([^\W\d_])\1{2,}')
print( re.sub(r'[^\W\d_] ', lambda x: Word(rx.sub(r'\1\1', x.group())).correct() if rx.search(x.group()) else x.group(), tweet) )
# => "I'm so happy about offline school"
The code uses the Textblob
library, but you may use any you like.
Note that ([^\W\d_])\1{2,}
matches any three or more consecutive letters, [^\W\d_]
matches one or more letters.
CodePudding user response:
Well, first of all you need a list (or set) of all allowed words, to compare with.
I'd approach it with the assumption (which might be wrong) that no words contain sequences of more than two repeating characters. So for each word generate a list of all potential candidates, for example "haaappppppyyyy" would yield you ["haappyy", "happyy", "happy", etc]. then it's just a matter of checking which one of those words actually exists by comparing to the allowed word list. The time complexity of this is quite high, tho so if it needs to go fast then throw a hash table on it or something :)