I have a list of many (18618) strings. I separated its words by implementing Tokenizing with NLTK library. But now each word - as in below- has an extra apostrophe and in some places â. How can I delete all of these?
I tried to delete them by implementing for loop but was unable to do it. What else can I do to solve this problem?
["'heart", "'darkness", "'nellie", "'cruising", "'yawl", ....................................]
CodePudding user response:
lst = ["'heart", "'darkness", "'nellie", "'cruising", "'yawl"]
# for every txt in lst remove the ' and â
new_lst = [txt.replace("'", "").replace("â", "") for txt in lst]
CodePudding user response:
To prevent this you may try decoding:
Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to decode it first, e.g. with
s.decode("utf8")
.