how to exclude some words from text with regular expression?-CodePudding

I have a question. I want to have a regular expression that find all Latina words except some special words. I mean that I want to delete all Latina words from my text except "@USER" and "@USERS" and "http". For example in this sentence:

"Hello سلام. من حسین هستم. @USER @USERS http this is a good topic."

will become like this:

"سلام. من حسین هستم. @USER @USERS http.

I tried this code but it doesn't work.

def remove_eng(sents):
    new_list = []
    for string in sents:
      string = str(string)
      new_list.append(' '.join(re.sub(r'^[a-zA-Z].*[a-zA-Z].?$', r'', w) 
                        for w in string.split()))
    return new_list

And the answer is like this:

[' سلام. من حسین هستم. @USER @USERS    a  ']

And I don't know how to exclude '@USER' and '@USERS' and 'http' Could anyone help me? Thanks.

CodePudding user response：

Use normal for-loop instead of list comprehension and then you can use if/else to exclude words before you use regex

import re

def remove_eng(sents):
    new_list = []

    for old_string in sents:
        old_words = old_string.split()
        new_words = []
        
        for word in old_words:
            if word in ('@USER', '@USERS', 'http'):
                new_words.append(word)    
            else:
                result = re.sub(r'^[a-zA-Z.?!] ', r'', word)
                #print(result)
                if result:  # skip empty 
                    new_words.append(result)
                    
        new_string = " ".join(new_words)
        new_list.append(new_string)

    return new_list

# --- main ---

data = ["Hello سلام. من حسین هستم. @USER @USERS http this is a good topic."]

print(remove_eng(data))

Result:

['سلام. من حسین هستم. @USER @USERS http']