I have a web story that has cencored word in it with asterix
right now i'm doing it with a simple and dumb str.replace
but as you can imagine this is a pain and I need to search in the text to find all instance of the censoring
here is bastard instance that are capitalized, plurial and with asterix in different places
toReplace = toReplace.replace("b*stard", "bastard")
toReplace = toReplace.replace("b*stards", "bastards")
toReplace = toReplace.replace("B*stard", "Bastard")
toReplace = toReplace.replace("B*stards", "Bastards")
toReplace = toReplace.replace("b*st*rd", "bastard")
toReplace = toReplace.replace("b*st*rds", "bastards")
toReplace = toReplace.replace("B*st*rd", "Bastard")
toReplace = toReplace.replace("B*st*rds", "Bastards")
is there a way to compare all word with "*" (or any other replacement character) to an already compiled dict and replace them with the uncensored version of the word ? maybe regex but I don't think so
CodePudding user response:
Using regex alone will likely not result in a full solution for this. You would likely have an easier time if you have a simple list of the words that you want to restore, and use Levenshtein distance to determine which one is closest to a given word that you have found a *
in.
One library that may help with this is fuzzywuzzy.
The two approaches that I can think of quickly:
- Split the text so that you have 1 string per word. For each word, if
'*' in word
, then compare it to the list of replacements to find which is closest. - Use
re.sub
to identify the words that contain a*
character, and write a function that you would use as therepl
argument to determine which replacement it is closest to and return that replacement.
Additional resources:
- Python: find closest string (from a list) to another string
- Find closest string match from list
- How to find closest match of a string from a list of different length strings python?
CodePudding user response:
You can use re
module to find matches between the censored word and words in your wordlist.
Replace *
with .
(dot has special meaning in regex, it means "match every character") and then use re.match
:
import re
wordlist = ["bastard", "apple", "orange"]
def find_matches(censored_word, wordlist):
pat = re.compile(censored_word.replace("*", "."))
return [w for w in wordlist if pat.match(w)]
print(find_matches("b*st*rd", wordlist))
Prints:
['bastard']
Note: If you want match exact word, add $
at the end of your pattern. That means appl*
will not match applejuice
in your dictionary for example.