I need help to automatically DEcensore a text (lot's of text to be prosseced)-CodePudding

I have a web story that has cencored word in it with asterix

right now i'm doing it with a simple and dumb str.replace

but as you can imagine this is a pain and I need to search in the text to find all instance of the censoring

here is bastard instance that are capitalized, plurial and with asterix in different places

toReplace = toReplace.replace("b*stard", "bastard")
toReplace = toReplace.replace("b*stards", "bastards")
toReplace = toReplace.replace("B*stard", "Bastard")
toReplace = toReplace.replace("B*stards", "Bastards")
toReplace = toReplace.replace("b*st*rd", "bastard")
toReplace = toReplace.replace("b*st*rds", "bastards")
toReplace = toReplace.replace("B*st*rd", "Bastard")
toReplace = toReplace.replace("B*st*rds", "Bastards")

is there a way to compare all word with "*" (or any other replacement character) to an already compiled dict and replace them with the uncensored version of the word ? maybe regex but I don't think so

CodePudding user response：

Using regex alone will likely not result in a full solution for this. You would likely have an easier time if you have a simple list of the words that you want to restore, and use Levenshtein distance to determine which one is closest to a given word that you have found a * in.

One library that may help with this is fuzzywuzzy.

The two approaches that I can think of quickly:

Split the text so that you have 1 string per word. For each word, if '*' in word, then compare it to the list of replacements to find which is closest.
Use re.sub to identify the words that contain a * character, and write a function that you would use as the repl argument to determine which replacement it is closest to and return that replacement.

Additional resources:

CodePudding user response：

You can use re module to find matches between the censored word and words in your wordlist.

Replace * with . (dot has special meaning in regex, it means "match every character") and then use re.match:

import re

wordlist = ["bastard", "apple", "orange"]


def find_matches(censored_word, wordlist):
    pat = re.compile(censored_word.replace("*", "."))
    return [w for w in wordlist if pat.match(w)]


print(find_matches("b*st*rd", wordlist))

Prints:

['bastard']

Note: If you want match exact word, add $ at the end of your pattern. That means appl* will not match applejuice in your dictionary for example.