Hello Guys Im wishing you all a very happy new year!!!
I have been wondering if it is possible to perform the feauture_frequency
of the powerful quanteda
library in R including also a list of phrases or "words" to be accounted for, for instance I have the following data set:
library(quanteda)
library(quanteda.textstats)
df_sample<-c("Word Record",
"be able to count by word",
"But also include some phrases such as",
"World Record Super Bass Mr. President Mr. President")
When I calculate the textstat_frequency
of the df_sample I get something like this:
> tokens<-corpus(df_sample) %>% tokens(remove_punct = TRUE)
> dfm<-dfm(tokens)
>
> quanteda.textstats::textstat_frequency(dfm)
feature frequency rank docfreq group
1 word 2 1 2 all
2 record 2 1 2 all
3 mr 2 1 1 all
4 president 2 1 1 all
5 be 1 5 1 all
6 able 1 5 1 all
7 to 1 5 1 all
8 count 1 5 1 all
9 by 1 5 1 all
10 but 1 5 1 all
11 also 1 5 1 all
12 include 1 5 1 all
13 some 1 5 1 all
14 phrases 1 5 1 all
15 such 1 5 1 all
16 as 1 5 1 all
17 world 1 5 1 all
18 super 1 5 1 all
19 bass 1 5 1 all
>
which is correct but I also want to change my code in other to take into account and print in the output the words or phrases "Mr. President", "World Record", "Super Bass"
key_lookups<-c("Mr. President", "World Record", "Super Bass" )
How can I use quanteda
funs to have in my output along with the previos counts also the frequency of the previous phrases,for example
"Mr. President" 2 "World Record" 2 "Super Bass" 1
Thank you so much for your help, I am a very active and attentive user I will be upvoting and selecting the best answer as soon as the come, Also please if you can reference or quote the source of the documentation where you are driving your answer from please also link it in your ans! THANKS A LOT! HAPPY NEW YEAR
CodePudding user response:
In the quanteda
library one could take advange of the fun tokens_compound
library(quanteda)
library(quanteda.textstats)
df_sample<-c("World Record",
"be able to count by word",
"But also include some phrases such as",
"World Record Super Bass Mr. President Mr. President")
toks <- tokens(df_sample,remove_punct = TRUE)
Now lets compound the key_lookups
over the toks
object
key_lookups<-c("Mr President", "World Record", "Super Bass" )
toks_comp <- tokens_compound(toks, pattern = phrase(key_lookups))
Take a look at the output:
> toks_comp %>% dfm() %>% textstat_frequency()
feature frequency rank docfreq group
1 world_record 2 1 2 all
2 mr_president 2 1 1 all
3 be 1 3 1 all
4 able 1 3 1 all
5 to 1 3 1 all
6 count 1 3 1 all
7 by 1 3 1 all
8 word 1 3 1 all
9 but 1 3 1 all
10 also 1 3 1 all
11 include 1 3 1 all
12 some 1 3 1 all
13 phrases 1 3 1 all
14 such 1 3 1 all
15 as 1 3 1 all
16 super_bass 1 3 1 all
CodePudding user response:
First: a warning about your example code: do not create objects that have the same name as functions (like tokens and dfm) this will (eventually) lead to errors and is difficult to debug.
There are probably a few ways of doing this. I created a "normal" tokens object and one ngrams tokens object. both turned into dfm's and from the ngrams dfm, I kept the phrases you wanted. Then combined the dfm's and you can use textstat_frequency
as normal.
Note: you can't combine tokens objects like you can combine dfm objects.
library(quanteda)
library(quanteda.textstats)
df_sample<-c("Word Record",
"be able to count by word",
"But also include some phrases such as",
"World Record Super Bass Mr. President Mr. President")
my_tokens <- corpus(df_sample) %>% tokens(remove_punct = TRUE)
my_dfm <- dfm(my_tokens)
# No points as they are removed in the dfm
key_lookups<-c("Mr President", "World Record", "Super Bass" )
my_tokens_ngram <- tokens_ngrams(my_tokens, n = 2, concatenator = " ")
my_dfm_ngrams <- dfm(my_tokens_ngram)
# Only keep the lookups
my_dfm_ngrams <- dfm_keep(my_dfm_ngrams, key_lookups)
# Combine both dfms
my_dfms <- rbind(my_dfm, my_dfm_ngrams)
# if necessary uncomment next part
# my_dfms <- dfm_compress(my_dfms)
outcome:
head(textstat_frequency(my_dfms), 5)
feature frequency rank docfreq group
1 word 2 1 2 all
2 record 2 1 2 all
3 mr 2 1 1 all
4 president 2 1 1 all
5 mr president 2 1 1 all
tail(textstat_frequency(my_dfms), 5)
feature frequency rank docfreq group
18 world 1 6 1 all
19 super 1 6 1 all
20 bass 1 6 1 all
21 world record 1 6 1 all
22 super bass 1 6 1 all
Do note that using rbind on dfms, creates a new document name like "text1.1". If you want this merged back to the original documents, you can call dfm_compress(my_dfms)
first and then call textstat_frequency
.