I have written a function to clean my text corpus, which is of the following form:
["wild things is a suspenseful .. twists . ",
"i know it already.. film goers . ",
.....,
"touchstone pictures..about it . okay ? "]
which is a list with the sentences separated by commas.
my function is:
def clean_sentences(sentences):
sentences = (re.sub(r'\d ','£', s) for s in sentences
stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is' , 'it']
sentences = ' '.join(w for w in sentences if w not in stopwords)
return sentences
It replaces the numbers with '£' but it does not remove the stopwords.
Output:
'wild things is a suspenseful thriller...
and a £ . £ rating , it\'s still watchable , just don\'t think about it . okay ? '
I dont understand why. thank you.
CodePudding user response:
I believe it's because you used regex to substitute digits for the symbol £ in your code. For clarification: sentences = (re.sub(r'\d ','£', s) for s in sentences
This is a piece of code that replaces any digits with that symbol. I see that you define your list of stopwords, and then make a new list without those stopwords. However, the symbol £
you replaced your numbers with is not in the list of stopwords, therefore it won't be excluded in your new list. You could try adding that to your list of stopwords like so:
def clean_sentences(sentences):
sentences = (re.sub(r'\d ','£', s) for s in sentences)
stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is' , 'it', '£']
sentences = ' '.join(w for w in sentences if w not in stopwords)
return sentences
Hope this helps!
CodePudding user response:
You compare the whole sentences to the stopwords when you actually want to compare words within the sentences to the stopwords.
import re
sentences = ["wild things is a suspenseful .. twists . ",
"i know it already.. film goers . ",
"touchstone pictures..about it . okay ? "]
stopwords = ['a', 'and', 'any', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'is', 'it']
As a loop:
new_sentences = []
for sentence in sentences:
new_sentence = sentence.split()
new_sentence = [re.sub(r'\d ', '£', word) for word in new_sentence]
new_sentence = [word for word in new_sentence if word not in stopwords]
new_sentence = " ".join(new_sentence)
new_sentences.append(new_sentence)
Or, much more compact, as a list comprehension:
new_sentences = [" ".join([re.sub(r'\d ', '£', word) for word in sentence.split() if word not in stopwords]) for sentence in sentences]
Which both return:
print(new_sentences)
> ['wild things suspenseful .. twists .', 'i know already.. film goers .', 'touchstone pictures..about . okay ?']