Home > Software engineering >  Remove punctuation marks from tokenized text using for loop
Remove punctuation marks from tokenized text using for loop

Time:09-17

I'm trying to remove punctuations from a tokenized text in python like so:

word_tokens = ntlk.tokenize(text)
w = word_tokens
for e in word_tokens:
    if e in punctuation_marks:
        w.remove(e)

This works somewhat, I manage to remove a lot of the punctuation marks but for some reason a lot of the punctuation marks in word_tokens are still left. If I run the code another time, it again removes some more of the punctuations. After running the same code 3 times all the marks are removed. Why does this happen?

It doesn't seem to matter whether punctuation_marks is a list, a string or a dictionary. I've also tried to iterate over word_tokens.copy() which does a bit better, it almost removes all marks the first time, and all the second time. Is there a simple way to fix this problem so that it is sufficient to run the code only once?

CodePudding user response:

You are removing elements from the same list that you are iterating. It seems that you are aware of the potential problem, that's why you added the line:

w = word_tokens

However, that line doesn't actually create a copy of the object referenced by word_tokens, it only makes w reference the same object. In order to create a copy you can use the slicing operator, replacing the above line by:

w = word_tokens[:]

CodePudding user response:

Why don't you add tokens that are not punctuations instead?

word_tokens = ntlk.tokenize(text)
w = list()
for e in word_tokens:
    if e not in punctuation_marks:
        w.append(e)

Suggestions: I see you are creating words tokens. If that's the case I would suggest you remove punctuations before tokenizing the text. You may use the translate function (under string library) that is already available.

# Import the library
import string

# Initialize the translate to remove punctuations
tr = str.maketrans("", "", string.punctuation)

# Remove punctuations
text = text.translate(tr)

# Get the word tokens
word_tokens = ntlk.tokenize(text)

If you want to do sentence tokenization, then you may do something like the below:

from nltk.tokenize import sent_tokenize

texts = sent_tokenize(text)
for i in range(0, len(texts))
    texts[i] = texts[i].translate(tr)

CodePudding user response:

I suggest you try regex and append your results to a new list and not directly manipulating the word_tokens's one:

word_tokens = ntlk.tokenize(text)
w_ = list()
for e in word_tokens:
    w_.append(re.sub('[.!?\\-]', e))

CodePudding user response:

You are modifying the the actual word_tokens, which is wrong.

For instance, say you have something like A?!B where it's indexed as: A:0, ?:1, !:2, B:3. Your for loop has a counter (say i) that increase at each loop. Say you remove the ? (Means i=1) that makes the array indexes shift back (New indexes are: A:0, !:1, B:2) and your counter increments (i=2). So you missed the ! character here!

Best not to mess with the original string and simply copy to a new one.

  • Related