Count unique words in a string using dplyr (R)-CodePudding

Let's say I have a string as follows:

string <- "the home home on the range the friend"

All I want to do is determine which words in the string appear at least 2 times.

The psuedocode here is:

Count how many times each word appears
Return list of words that have more than two appearances in the string

Final result should be a list featuring both the and home, in that order.

I am hoping to do this using the tidyverse, ideally with stringr or dplyr. Was attempting to use tidytext as well but have been struggling.

CodePudding user response：

We can split the string by space, get the table and subset based on frequency

out <- table(strsplit(string, "\\s ")[[1]])
out[out >=2]

home  the 
   2    3

CodePudding user response：

library(tidyverse)
data.frame(string) %>%
  separate_rows(string) %>%
  count(string, sort = TRUE) %>%
  filter(n >= 2)

Result

# A tibble: 2 × 2
  string     n
  <chr>  <int>
1 the        3
2 home       2

CodePudding user response：

Here's an approach using quanteda that prints "the" before "home" as requested in the original post.

library(quanteda)
aString <- "the home home on the range the friend"
aDfm<- dfm(tokens(aString))
# extract the features where the count > 1
aDfm@Dimnames$features[aDfm@x > 1]

...and the output:

> aDfm@Dimnames$features[aDfm@x > 1]
[1] "the"  "home"

CodePudding user response：

Here is another option using tidytext and tidyverse, where we first separate each word (unnest_tokens), then we can count each word and sort by frequency. Then, we keep only words that have more than 1 observation, then use tibble::deframe to return a named vector.

library(tidytext)
library(tidyverse)

tibble(string) %>%
  unnest_tokens(word, string) %>%
  count(word, sort = TRUE) %>%
  filter(n >= 2) %>%
  deframe()

Output

 the home 
   3    2

Or if you want to leave as a dataframe, then you can just ignore the last step with deframe.