Remove punctuation marks from tokenized text using for loop-CodePudding

I'm trying to remove punctuations from a tokenized text in python like so:

word_tokens = ntlk.tokenize(text)
w = word_tokens
for e in word_tokens:
    if e in punctuation_marks:
        w.remove(e)

This works somewhat, I manage to remove a lot of the punctuation marks but for some reason a lot of the punctuation marks in word_tokens are still left. If I run the code another time, it again removes some more of the punctuations. After running the same code 3 times all the marks are removed. Why does this happen?

It doesn't seem to matter whether punctuation_marks is a list, a string or a dictionary. I've also tried to iterate over word_tokens.copy() which does a bit better, it almost removes all marks the first time, and all the second time. Is there a simple way to fix this problem so that it is sufficient to run the code only once?

CodePudding user response：

You are removing elements from the same list that you are iterating. It seems that you are aware of the potential problem, that's why you added the line:

w = word_tokens

However, that line doesn't actually create a copy of the object referenced by word_tokens, it only makes w reference the same object. In order to create a copy you can use the slicing operator, replacing the above line by:

w = word_tokens[:]

CodePudding user response：

Why don't you add tokens that are not punctuations instead?

word_tokens = ntlk.tokenize(text)
w = list()
for e in word_tokens:
    if e not in punctuation_marks:
        w.append(e)

Suggestions: I see you are creating words tokens. If that's the case I would suggest you remove punctuations before tokenizing the text. You may use the translate function (under string library) that is already available.

# Import the library
import string

# Initialize the translate to remove punctuations
tr = str.maketrans("", "", string.punctuation)

# Remove punctuations
text = text.translate(tr)

# Get the word tokens
word_tokens = ntlk.tokenize(text)

If you want to do sentence tokenization, then you may do something like the below:

from nltk.tokenize import sent_tokenize

texts = sent_tokenize(text)
for i in range(0, len(texts))
    texts[i] = texts[i].translate(tr)

CodePudding user response：

I suggest you try regex and append your results to a new list and not directly manipulating the word_tokens's one:

word_tokens = ntlk.tokenize(text)
w_ = list()
for e in word_tokens:
    w_.append(re.sub('[.!?\\-]', e))

CodePudding user response：

You are modifying the the actual word_tokens, which is wrong.

For instance, say you have something like A?!B where it's indexed as: A:0, ?:1, !:2, B:3. Your for loop has a counter (say i) that increase at each loop. Say you remove the ? (Means i=1) that makes the array indexes shift back (New indexes are: A:0, !:1, B:2) and your counter increments (i=2). So you missed the ! character here!

Best not to mess with the original string and simply copy to a new one.