I am trying to create a dictionary of words in a text and their context. The context should be the list of words that occur within a 5 word window (two words on either side) of the term's position in the string. Effectively, I want to ignore the stopwords in my output vectors.
My code is below. I can get the stopwords out of my dictionary's keys but not the values.
words = ["This", "is", "an", "example", "sentence" ]
stopwords = ["it", "the", "was", "of"]
context_size = 2
stripes = {word:words[max(i - context_size,0):j] for word,i,j in zip(words,count(0),count(context_size 1)) if word.lower() not in stopwords}
print(stripes)
the output is:
{'example': ['is', 'an', 'example', 'sentence'], 'sentence': ['an', 'example', 'sentence']}
CodePudding user response:
You can remove stopwords
from words
before processing:
words = ["This", "was", "an", "example", "sentence" ]
stopwords = ["it", "the", "was", "of"]
words = [w for w in words if not w in stopwords]
Output:
>>> words
['This', 'an', 'example', 'sentence']
Now, you can execute your function without stopwords
CodePudding user response:
words = ["This", "is", "a", "longer", "example", "sentence"]
stopwords = set(["it", "the", "was", "of", "is", "a"])
context_size = 2
stripes = []
for index, word in enumerate(words):
if word.lower() in stopwords:
continue
i = max(index - context_size, 0)
j = min(index context_size, len(words) - 1) 1
context = words[i:index] words[index 1:j]
stripes.append((word, context))
print(stripes)
I would recommend to use a tuple list so in case a word occurs more than once in words
the dict does not just contain the last one which overwrites previous ones. I would also put stopwords in a set, especially if its a larger list like NLTKs stopwords since that speeds up things.
I also excluded the word itself from the context but depending on how you want to use it you might want to include it.
This results in:
[('This', ['is', 'a']), ('longer', ['is', 'a', 'example', 'sentence']), ('example', ['a', 'longer', 'sentence']), ('sentence', ['longer', 'example'])]