Let's say I have a string as follows:
string <- "the home home on the range the friend"
All I want to do is determine which words in the string appear at least 2 times.
The psuedocode here is:
- Count how many times each word appears
- Return list of words that have more than two appearances in the string
Final result should be a list featuring both the
and home
, in that order.
I am hoping to do this using the tidyverse, ideally with stringr or dplyr. Was attempting to use tidytext as well but have been struggling.
CodePudding user response:
We can split the string by space, get the table
and subset based on frequency
out <- table(strsplit(string, "\\s ")[[1]])
out[out >=2]
home the
2 3
CodePudding user response:
library(tidyverse)
data.frame(string) %>%
separate_rows(string) %>%
count(string, sort = TRUE) %>%
filter(n >= 2)
Result
# A tibble: 2 × 2
string n
<chr> <int>
1 the 3
2 home 2
CodePudding user response:
Here's an approach using quanteda
that prints "the" before "home" as requested in the original post.
library(quanteda)
aString <- "the home home on the range the friend"
aDfm<- dfm(tokens(aString))
# extract the features where the count > 1
aDfm@Dimnames$features[aDfm@x > 1]
...and the output:
> aDfm@Dimnames$features[aDfm@x > 1]
[1] "the" "home"
CodePudding user response:
Here is another option using tidytext
and tidyverse
, where we first separate each word (unnest_tokens
), then we can count each word and sort by frequency. Then, we keep only words that have more than 1 observation, then use tibble::deframe
to return a named vector.
library(tidytext)
library(tidyverse)
tibble(string) %>%
unnest_tokens(word, string) %>%
count(word, sort = TRUE) %>%
filter(n >= 2) %>%
deframe()
Output
the home
3 2
Or if you want to leave as a dataframe, then you can just ignore the last step with deframe
.