I think regex is the best solution here, because when i try this:
forbidden_words = ["sex", "porn", "dick", "drug", "casino", "gambling"]
def censor(string):
# Remove line breaks and make it lowercase
string = " ".join(string.splitlines()).lower()
for word in forbidden_words:
if word in string:
string = string.replace(word, '*' * len(word))
print(f"Forbidden word REMOVED: {word}")
return string
print(censor("Sex, pornography, and Dicky are ALL not allowed."))
It returns all lowercase, I don't want to convert all to lowercase:
***, ****ography, and ****y are all not allowed.
I want my python code to return:
***, ****ography, and ****y are ALL not allowed.
My Regex below returns:
***, pornography, and dicky are ALL not allowed.
My Regex code:
import re
forbidden_words = ["sex", "porn", "dick", "drug", "casino", "gambling"]
def censor(string):
# Remove line breaks
string = " ".join(string.splitlines())
for word in forbidden_words:
# Use a regular expression to search for the word, ignoring case
pattern = r"\b{}\b".format(word)
if re.search(pattern, string, re.IGNORECASE):
string = re.sub(pattern, '*' * len(word), string, flags=re.IGNORECASE)
print(f"Forbidden word REMOVED: {word}")
return string
print(censor("Sex, pornography, and Dicky are ALL not allowed."))
Also, Is regex the best solution here? I feel like I am writing a lot of unnecessary codes. Sorry I am new to Python. Thanks.
CodePudding user response:
You can compile the regex with |
use ignorecase flag:
import re
forbidden_words = ["sex", "porn", "dick", "drug", "casino", "gambling"]
pat = re.compile("|".join(re.escape(w) for w in forbidden_words), flags=re.I)
def censor(s):
return pat.sub(lambda g: "*" * len(g.group(0)), s)
print(censor("Sex, pornography, and Dicky are ALL not allowed."))
Prints:
***, ****ography, and ****y are ALL not allowed.
CodePudding user response:
You are almost there yourself! Just a small modification to your code can get you the behavior that you need. Right now your code is making edits to the complete sentence if a keyword is found, but you can do this operation at a token level instead. That will allow you to control which tokens remain untouched and which of them are modified.
Here is a fix to your existing code that will work as expected.
forbidden_words = ["sex", "porn", "dick", "drug", "casino", "gambling"]
def censor(token):
# if word not found, the token is just passed thru as is.
for word in forbidden_words:
if word in token.lower(): #<---
token = token.lower().replace(word, '*' * len(word)) #<---
print(f"Forbidden word REMOVED: {word}")
return token
text = "Sex, pornography, and Dicky are ALL not allowed."
result = ' '.join([censor(i) for i in text.split()]) #<---
print(result)
Forbidden word REMOVED: sex
Forbidden word REMOVED: porn
Forbidden word REMOVED: dick
***, ****ography, and ****y are ALL not allowed.