Home > Software design >  How to plot a word frequency ranking in ggplot - only have one variable?
How to plot a word frequency ranking in ggplot - only have one variable?

Time:09-22

I'm trying to use ggplot to plot my word frequency rankings from Quanteda. Works passing the 'frequency' variable to plot but I want a nicer graph.

ggplot needs two variables for aes. I've tried seq_along as suggested on a somewhat similar thread but the graph draws nothing.

ggplot(word_list, aes(x = seq_along(freqs), y = freqs, group = 1))   
        geom_line()  
        labs(title = "Rank Frequency Plot", x = "Rank", y = "Frequency")

Any input appreciated!

symptoms_corpus <- corpus(X$TEXT, docnames = X$id )

summary(symptoms_corpus)

# print text of any element of the corpus by index
cat(as.character(symptoms_corpus[6500]))

# Create Document Feature Matrix
Symptoms_DFM <- dfm(symptoms_corpus)
Symptoms_DFM

# sum columns for word counts
freqs <- colSums(Symptoms_DFM)
# get vocabulary vector
words <- colnames(Symptoms_DFM)
# combine words and their frequencies in a data frame
word_list <- data.frame(words, freqs)
# re-order the wordlist by decreasing frequency
word_indexes <- order(word_list[, "freqs"], decreasing = TRUE)

word_list <- word_list[word_indexes, ]
# show the most frequent words
head(word_list, 25)

#plot
ggplot(word_list, aes(x = seq_along(freqs), y = freqs, group = 1))   
        geom_line()  
        labs(title = "Rank Frequency Plot", x = "Rank", y = "Frequency")

By nicer graph I mean that using the base 'plot' function below works and illustrates the ranking distribution, but this takes only one variable. ggplot needs two and that's where the issue arises for me. The ggplot code will draw the graph but no data displayed.

plot(word_list$freqs , type = "l", lwd=2, main = "Rank frequency Plot", xlab="Rank", ylab ="Frequency")

Example dataset below:

first_column <- c("the","patient", "arm", "rash", "tingling", "was", "in", "not")
second_column <- c("4116407", "3599537", "2582586", "1323883", "1220894", "1012042", "925339", "822150")

word_list2 <- data.frame(first_column, second_column)
colnames(word_list2) <- c=("word", "freqs")

CodePudding user response:

Here's a tidier, and reproducible demonstration of your plot, using the built-in corpus.

library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

symptoms_corpus <- data_corpus_inaugural
Symptoms_DFM <- tokens(symptoms_corpus) %>%
  dfm()

Better to use quanteda.textstats::textstat_frequency() here:

# create frequency table
library("quanteda.textstats")
word_list <- textstat_frequency(Symptoms_DFM)
head(word_list, 25)
##    feature frequency rank docfreq group
## 1      the     10183    1      59   all
## 2       of      7180    2      59   all
## 3        ,      7173    3      59   all
## 4      and      5406    4      59   all
## 5        .      5155    5      59   all
## 6       to      4591    6      59   all
## 7       in      2827    7      59   all
## 8        a      2292    8      58   all
## 9      our      2224    9      58   all
## 10      we      1827   10      58   all
## 11    that      1813   11      59   all
## 12      be      1502   12      59   all
## 13      is      1491   13      58   all
## 14      it      1398   14      59   all
## 15     for      1230   15      59   all
## 16      by      1091   16      59   all
## 17    have      1031   17      59   all
## 18   which      1007   18      57   all
## 19     not       980   19      58   all
## 20    with       970   20      58   all
## 21      as       966   21      58   all
## 22    will       944   22      57   all
## 23    this       874   23      59   all
## 24       i       871   24      58   all
## 25     all       836   25      59   all

Then plot it:

# Zipf's law plot
library("ggplot2")
ggplot(word_list, aes(x = seq_len(nrow(word_list)), y = frequency, group = 1))  
  geom_line()  
  coord_trans(y = "log10", x = "log10")  
  labs(title = "Rank Frequency Plot", x = "Rank", y = "Frequency")

CodePudding user response:

I am unsure of what you mean by a "nicer graph". Could you specify? It is not possible to reproduce the example by the code that you provided, because we do not have your dataset.

Perhaps you could simply add row numbers as the x-values as below. This produces an ordered graph

library(ggplot2)

word_list <- data.frame(freq = c(10, 12, 18, 19))

ggplot(word_list, aes(x = 1:nrow(word_list), y = freq, group = 1))   
  geom_line()  
  labs(title = "Rank Frequency Plot", x = "Rank", y = "Frequency")

CodePudding user response:

I needed to logarithmically scale, dataset is huge so wasn't appearing. Example above, @TrineCosmusNobel, pointed this out. Thanks. Updated code below:

ggplot(word_list, aes(x = 1:nrow(word_list), y = freqs, group = 1))   
        geom_line()  
        coord_trans(y ='log10', x='log10')  
        labs(title = "Rank Frequency Plot", x = "Rank", y = "Frequency")
  • Related