I have a question. I want to have a regular expression that find all Latina words except some special words. I mean that I want to delete all Latina words from my text except "@USER" and "@USERS" and "http". For example in this sentence:
"Hello سلام. من حسین هستم. @USER @USERS http this is a good topic."
will become like this:
"سلام. من حسین هستم. @USER @USERS http.
I tried this code but it doesn't work.
def remove_eng(sents):
new_list = []
for string in sents:
string = str(string)
new_list.append(' '.join(re.sub(r'^[a-zA-Z].*[a-zA-Z].?$', r'', w)
for w in string.split()))
return new_list
And the answer is like this:
[' سلام. من حسین هستم. @USER @USERS a ']
And I don't know how to exclude '@USER' and '@USERS' and 'http' Could anyone help me? Thanks.
CodePudding user response:
Use normal for
-loop instead of list comprehension and then you can use if/else
to exclude words before you use regex
import re
def remove_eng(sents):
new_list = []
for old_string in sents:
old_words = old_string.split()
new_words = []
for word in old_words:
if word in ('@USER', '@USERS', 'http'):
new_words.append(word)
else:
result = re.sub(r'^[a-zA-Z.?!] ', r'', word)
#print(result)
if result: # skip empty
new_words.append(result)
new_string = " ".join(new_words)
new_list.append(new_string)
return new_list
# --- main ---
data = ["Hello سلام. من حسین هستم. @USER @USERS http this is a good topic."]
print(remove_eng(data))
Result:
['سلام. من حسین هستم. @USER @USERS http']