Home > Blockchain >  finding distribution of words in R
finding distribution of words in R

Time:10-18

I want to find the distribution of number of titles with 1 word, 2 words, 3 words, ... in my dataset "jnl.dt" in R.

one_word_title = 0
two_word_title = 0
three_word_title = 0
for (i in 1:x){
  if (str_count(jnl.dt[i]$`Full Title`, '\\w ')==1){one_word_title <- one_word_title 1}
  else if (str_count(jnl.dt[i]$`Full Title`, '\\w ')==2){two_word_title <- two_word_title 1}
  else if (str_count(jnl.dt[i]$`Full Title`, '\\w ')==3){three_word_title <- three_word_title 1}
}
one_word_title
two_word_title 
three_word_title 

Is there a way to find the distribution of number of titles with different number of words without hardcoding the number of words in title?

CodePudding user response:

Here's a proposal somewhat tentative given the absence of reproducible data:

Let's assume you have this kind of data and titles:

df <- data.frame(titles = c("The Great Gatsby", "That's the Story of my Life", "Love Story", "Alice in Wonderland", "Harry Potter"))

To get the "distribution" of number of words in the titlesyou can do this:

library(dplyr)
library(stringr)
df %>%
  mutate(N_w = str_count(titles, "\\S ")) %>%
  group_by(N_w) %>% 
  summarise(Dist_N_w = n())
# A tibble: 3 x 2
    N_w Dist_N_w
* <int>    <int>
1     2        2
2     3        2
3     6        1

Note that using \\w and, respectively, \\S makes a difference: as the apostrophe is not contained in the \\w character class (for letter, digits, and the underscore) That's will be counted as 2 words. If you use \\S instead, which is a negative character class matching anything that is a whitespace (including actual whitespace and also new line and return characters etc.), the count for That's will be 1.

CodePudding user response:

Instead of doing this for every word separately, you can do this together.

table(stringr::str_count(jnl.dt$`Full Title`, '\\w '))

CodePudding user response:

We may use unnest_tokens

library(tidytext)
library(dplyr)
df %>%
    mutate(rn = row_number()) %>% 
    unnest_tokens(word, titles) %>% 
    count(rn) %>%
    count(n)
  •  Tags:  
  • r
  • Related