Home > Software design >  Count unique words in a string using dplyr (R)
Count unique words in a string using dplyr (R)

Time:06-13

Let's say I have a string as follows:

string <- "the home home on the range the friend"

All I want to do is determine which words in the string appear at least 2 times.

The psuedocode here is:

  • Count how many times each word appears
  • Return list of words that have more than two appearances in the string

Final result should be a list featuring both the and home, in that order.

I am hoping to do this using the tidyverse, ideally with stringr or dplyr. Was attempting to use tidytext as well but have been struggling.

CodePudding user response:

We can split the string by space, get the table and subset based on frequency

out <- table(strsplit(string, "\\s ")[[1]])
out[out >=2]

home  the 
   2    3 

CodePudding user response:

library(tidyverse)
data.frame(string) %>%
  separate_rows(string) %>%
  count(string, sort = TRUE) %>%
  filter(n >= 2)

Result

# A tibble: 2 × 2
  string     n
  <chr>  <int>
1 the        3
2 home       2

CodePudding user response:

Here's an approach using quanteda that prints "the" before "home" as requested in the original post.

library(quanteda)
aString <- "the home home on the range the friend"
aDfm<- dfm(tokens(aString))
# extract the features where the count > 1
aDfm@Dimnames$features[aDfm@x > 1]

...and the output:

> aDfm@Dimnames$features[aDfm@x > 1]
[1] "the"  "home"

CodePudding user response:

Here is another option using tidytext and tidyverse, where we first separate each word (unnest_tokens), then we can count each word and sort by frequency. Then, we keep only words that have more than 1 observation, then use tibble::deframe to return a named vector.

library(tidytext)
library(tidyverse)

tibble(string) %>%
  unnest_tokens(word, string) %>%
  count(word, sort = TRUE) %>%
  filter(n >= 2) %>%
  deframe()

Output

 the home 
   3    2 

Or if you want to leave as a dataframe, then you can just ignore the last step with deframe.

  • Related