While using WordCloud for Python, why is the frequency of the letter "S" considered in the-CodePudding

I'm getting to know the WordCloud package for Python and I'm testing it with the Moby Dick Text from NLTK. A snippet of this is as follows:

Snippet of my example string

As you can see from the highlights in the image, all of the possesive apostrophes have been escaped to "/'S" and WordCount seems to be including this in the frequency count as "S":

Frequency distribution of words

Of course this causes an issue because "S" is counted as a high frequency and all the other word's frequency are skewed in the cloud:

Example of my skewed cloud

In a tutorial that I'm following for the same Moby Dick string, the WordCloud doesn't seem to be counting the "S". Am I missing an attribute somewhere or do I have to manually remove "/'s" from my string?

Below is a summary of my code:

example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
word_list = ["".join(word) for word in example_corpus]
novel_as_string = " ".join(word_list)

wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")

plt.show()

CodePudding user response：

In such application, usually use stopwords to filter the word list first, since you don't want simple words, such as a, an, the, it, ..., to dominate your result.

changed the code a little bit, hope it helps. you can check the content of stopwords.

import nltk
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
# word_list = ["".join(word) for word in example_corpus] # this statement seems like change nothing
# using stopwords to filter words
word_list = [word for word in example_corpus if word not in stopwords.words('english')]
novel_as_string = " ".join(word_list)

wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")

plt.show()

output: see wordcloud Imgur

CodePudding user response：

It looks like your input is part of the problem, if you look do like so,

corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
words = [word for word in  corpus]
print word[215:230]

You get

['RICHARDSON', "'", 'S', 'DICTIONARY', 'KETOS', ',', 'GREEK', '.', 'CETUS', ',', 'LATIN', '.', 'WHOEL', ',', 'ANGLO']

You can do a few things to try and overcome this, you could just filter for strings longer than 1,

words = [word for word in corpus if len(word) > 1]

You could try a different file provided by nltk, or you could try reading the input raw and properly decoding it.