Removing a custom list of stopwords for an nlp task-CodePudding

I have written a function to clean my text corpus, which is of the following form:

["wild things is a suspenseful .. twists .  ",
 "i know it already.. film goers .  ",
.....,
"touchstone pictures..about it .  okay ?  "]

which is a list with the sentences separated by commas.

my function is:

def clean_sentences(sentences):  
   
    sentences = (re.sub(r'\d ','£', s) for s in sentences
 
    stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is' , 'it']
       
    sentences = ' '.join(w for w in sentences if w not in stopwords)

    return sentences

It replaces the numbers with '£' but it does not remove the stopwords.

Output:

'wild things is a suspenseful thriller...

and a £ . £ rating , it\'s still watchable , just don\'t think about it .  okay ?  '

I dont understand why. thank you.

CodePudding user response：

I believe it's because you used regex to substitute digits for the symbol £ in your code. For clarification: sentences = (re.sub(r'\d ','£', s) for s in sentences

This is a piece of code that replaces any digits with that symbol. I see that you define your list of stopwords, and then make a new list without those stopwords. However, the symbol £ you replaced your numbers with is not in the list of stopwords, therefore it won't be excluded in your new list. You could try adding that to your list of stopwords like so:

def clean_sentences(sentences):  
   
    sentences = (re.sub(r'\d ','£', s) for s in sentences)
 
    stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is' , 'it', '£']
       
    sentences = ' '.join(w for w in sentences if w not in stopwords)

    return sentences

Hope this helps!

CodePudding user response：

You compare the whole sentences to the stopwords when you actually want to compare words within the sentences to the stopwords.

import re

sentences = ["wild things is a suspenseful .. twists .  ",
             "i know it already.. film goers .  ",
             "touchstone pictures..about it .  okay ?  "]

stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is', 'it']

As a loop:

new_sentences = []
for sentence in sentences:
    new_sentence = sentence.split()
    new_sentence = [re.sub(r'\d ', '£', word) for word in new_sentence]
    new_sentence = [word for word in new_sentence if word not in stopwords]
    new_sentence = " ".join(new_sentence)
    new_sentences.append(new_sentence)

Or, much more compact, as a list comprehension:

new_sentences = [" ".join([re.sub(r'\d ', '£', word) for word in sentence.split() if word not in stopwords]) for sentence in sentences]

Which both return:

print(new_sentences)
> ['wild things suspenseful .. twists .', 'i know already.. film goers .', 'touchstone pictures..about . okay ?']