Home > Mobile >  Create dictionary of context words without stopwords
Create dictionary of context words without stopwords

Time:02-13

I am trying to create a dictionary of words in a text and their context. The context should be the list of words that occur within a 5 word window (two words on either side) of the term's position in the string. Effectively, I want to ignore the stopwords in my output vectors.

My code is below. I can get the stopwords out of my dictionary's keys but not the values.

words = ["This", "is", "an", "example", "sentence" ]
stopwords = ["it", "the", "was", "of"]
context_size = 2


stripes = {word:words[max(i - context_size,0):j] for word,i,j in zip(words,count(0),count(context_size 1)) if word.lower() not in stopwords}
print(stripes)

the output is:

{'example': ['is', 'an', 'example', 'sentence'], 'sentence': ['an', 'example', 'sentence']}

CodePudding user response:

You can remove stopwords from words before processing:

words = ["This", "was", "an", "example", "sentence" ]
stopwords = ["it", "the", "was", "of"]

words = [w for w in words if not w in stopwords]

Output:

>>> words
['This', 'an', 'example', 'sentence']

Now, you can execute your function without stopwords

CodePudding user response:

words = ["This", "is", "a", "longer", "example", "sentence"]
stopwords = set(["it", "the", "was", "of", "is", "a"])
context_size = 2

stripes = []
for index, word in enumerate(words):
    if word.lower() in stopwords:
        continue
    i = max(index - context_size, 0)
    j = min(index   context_size, len(words) - 1)   1
    context = words[i:index]   words[index   1:j]
    stripes.append((word, context))

print(stripes)

I would recommend to use a tuple list so in case a word occurs more than once in words the dict does not just contain the last one which overwrites previous ones. I would also put stopwords in a set, especially if its a larger list like NLTKs stopwords since that speeds up things.

I also excluded the word itself from the context but depending on how you want to use it you might want to include it.

This results in:

[('This', ['is', 'a']), ('longer', ['is', 'a', 'example', 'sentence']), ('example', ['a', 'longer', 'sentence']), ('sentence', ['longer', 'example'])]
  • Related