finding distribution of words in R-CodePudding

I want to find the distribution of number of titles with 1 word, 2 words, 3 words, ... in my dataset "jnl.dt" in R.

one_word_title = 0
two_word_title = 0
three_word_title = 0
for (i in 1:x){
  if (str_count(jnl.dt[i]$`Full Title`, '\\w ')==1){one_word_title <- one_word_title 1}
  else if (str_count(jnl.dt[i]$`Full Title`, '\\w ')==2){two_word_title <- two_word_title 1}
  else if (str_count(jnl.dt[i]$`Full Title`, '\\w ')==3){three_word_title <- three_word_title 1}
}
one_word_title
two_word_title 
three_word_title

Is there a way to find the distribution of number of titles with different number of words without hardcoding the number of words in title?

CodePudding user response：

Here's a proposal somewhat tentative given the absence of reproducible data:

Let's assume you have this kind of data and titles:

df <- data.frame(titles = c("The Great Gatsby", "That's the Story of my Life", "Love Story", "Alice in Wonderland", "Harry Potter"))

To get the "distribution" of number of words in the titlesyou can do this:

library(dplyr)
library(stringr)
df %>%
  mutate(N_w = str_count(titles, "\\S ")) %>%
  group_by(N_w) %>% 
  summarise(Dist_N_w = n())
# A tibble: 3 x 2
    N_w Dist_N_w
* <int>    <int>
1     2        2
2     3        2
3     6        1

Note that using \\w and, respectively, \\S makes a difference: as the apostrophe is not contained in the \\w character class (for letter, digits, and the underscore) That's will be counted as 2 words. If you use \\S instead, which is a negative character class matching anything that is a whitespace (including actual whitespace and also new line and return characters etc.), the count for That's will be 1.

CodePudding user response：

Instead of doing this for every word separately, you can do this together.

table(stringr::str_count(jnl.dt$`Full Title`, '\\w '))

CodePudding user response：

We may use unnest_tokens

library(tidytext)
library(dplyr)
df %>%
    mutate(rn = row_number()) %>% 
    unnest_tokens(word, titles) %>% 
    count(rn) %>%
    count(n)