I try to remove all English words in a list that contains 1000 lines written on the same pattern :
Englishword Vietnamese word \t
...
Word từ \t
...
Young trẻ tuổi \t this is the last line
i tried :
dix_words = """Ability có khả năng
About Về
Above ở trên
Abuse lạm dụng
Accept Chấp nhận
Access tới gần
Achieve Hoàn thành
Acknowledge thừa nhận
Acquire giành được
Across băng qua"""
lines = dix_words
list_of_words = lines.splitlines()
print(list_of_words)
list_of_vn_words = "" # create new final string
for word in list_of_words:
word_vn = re.sub(r'.*\t$', '', word) # create new word_vn
list_of_vn_words = list_of_vn_words.append(word_vn) # create new string
My regex is supposed to replace all letters (.*) before the end \t ($) by 'nothing'
Hard for me to see that regex and me are not really on the same wavelength
because my word_vn is same as word
i will find a way for append which doesn't work with string...
CodePudding user response:
Your logic is correct, it is just that the space between the words are normal white space (\s
) character and not \t
. Here is the working code:
import re
dix_words = """Ability có khả năng
About Về
Above ở trên
Abuse lạm dụng
Accept Chấp nhận
Access tới gần
Achieve Hoàn thành
Acknowledge thừa nhận
Acquire giành được
Across băng qua"""
lines = dix_words
list_of_words = lines.splitlines()
print(list_of_words)
list_of_vn_words = [] # create new final string
for word in list_of_words:
worn_vn = re.sub(r'.*\t$', '', word)
if worn_vn == word:
word_vn = re.sub('.*\s{4}', '', word) # create new word_vn
list_of_vn_words.append(word_vn) # create new string