How to censor words in python?-CodePudding

I think regex is the best solution here, because when i try this:

forbidden_words = ["sex", "porn", "dick", "drug", "casino", "gambling"]
def censor(string):
    # Remove line breaks and make it lowercase
    string = " ".join(string.splitlines()).lower()
    for word in forbidden_words:
        if word in string:
            string = string.replace(word, '*' * len(word))
            print(f"Forbidden word REMOVED: {word}")
    return string
print(censor("Sex, pornography, and Dicky are ALL not allowed."))

It returns all lowercase, I don't want to convert all to lowercase:

***, ****ography, and ****y are all not allowed.

I want my python code to return:

***, ****ography, and ****y are ALL not allowed.

My Regex below returns:

***, pornography, and dicky are ALL not allowed.

My Regex code:

import re

forbidden_words = ["sex", "porn", "dick", "drug", "casino", "gambling"]

def censor(string):
    # Remove line breaks
    string = " ".join(string.splitlines())
    for word in forbidden_words:
        # Use a regular expression to search for the word, ignoring case
        pattern = r"\b{}\b".format(word)
        if re.search(pattern, string, re.IGNORECASE):
            string = re.sub(pattern, '*' * len(word), string, flags=re.IGNORECASE)
            print(f"Forbidden word REMOVED: {word}")
    return string

print(censor("Sex, pornography, and Dicky are ALL not allowed."))

Also, Is regex the best solution here? I feel like I am writing a lot of unnecessary codes. Sorry I am new to Python. Thanks.

CodePudding user response：

You can compile the regex with | use ignorecase flag:

import re

forbidden_words = ["sex", "porn", "dick", "drug", "casino", "gambling"]

pat = re.compile("|".join(re.escape(w) for w in forbidden_words), flags=re.I)


def censor(s):
    return pat.sub(lambda g: "*" * len(g.group(0)), s)


print(censor("Sex, pornography, and Dicky are ALL not allowed."))

Prints:

***, ****ography, and ****y are ALL not allowed.

CodePudding user response：

You are almost there yourself! Just a small modification to your code can get you the behavior that you need. Right now your code is making edits to the complete sentence if a keyword is found, but you can do this operation at a token level instead. That will allow you to control which tokens remain untouched and which of them are modified.

Here is a fix to your existing code that will work as expected.

forbidden_words = ["sex", "porn", "dick", "drug", "casino", "gambling"]


def censor(token):

    # if word not found, the token is just passed thru as is.

    for word in forbidden_words:
        if word in token.lower():                                #<---
            token = token.lower().replace(word, '*' * len(word)) #<---
            print(f"Forbidden word REMOVED: {word}")
    
    return token


text = "Sex, pornography, and Dicky are ALL not allowed."
result = ' '.join([censor(i) for i in text.split()])             #<---
print(result)

Forbidden word REMOVED: sex
Forbidden word REMOVED: porn
Forbidden word REMOVED: dick

***, ****ography, and ****y are ALL not allowed.