I'm getting to know the WordCloud package for Python and I'm testing it with the Moby Dick Text from NLTK. A snippet of this is as follows:
As you can see from the highlights in the image, all of the possesive apostrophes have been escaped to "/'S" and WordCount seems to be including this in the frequency count as "S":
Frequency distribution of words
Of course this causes an issue because "S" is counted as a high frequency and all the other word's frequency are skewed in the cloud:
In a tutorial that I'm following for the same Moby Dick string, the WordCloud doesn't seem to be counting the "S". Am I missing an attribute somewhere or do I have to manually remove "/'s" from my string?
Below is a summary of my code:
example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
word_list = ["".join(word) for word in example_corpus]
novel_as_string = " ".join(word_list)
wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
CodePudding user response:
In such application, usually use stopwords
to filter the word list first, since you don't want simple words, such as a, an, the, it, ...
, to dominate your result.
changed the code a little bit, hope it helps. you can check the content of stopwords
.
import nltk
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
# word_list = ["".join(word) for word in example_corpus] # this statement seems like change nothing
# using stopwords to filter words
word_list = [word for word in example_corpus if word not in stopwords.words('english')]
novel_as_string = " ".join(word_list)
wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
output: see wordcloud Imgur
CodePudding user response:
It looks like your input is part of the problem, if you look do like so,
corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
words = [word for word in corpus]
print word[215:230]
You get
['RICHARDSON', "'", 'S', 'DICTIONARY', 'KETOS', ',', 'GREEK', '.', 'CETUS', ',', 'LATIN', '.', 'WHOEL', ',', 'ANGLO']
You can do a few things to try and overcome this, you could just filter for strings longer than 1,
words = [word for word in corpus if len(word) > 1]
You could try a different file provided by nltk, or you could try reading the input raw and properly decoding it.