Quanteda calculating tokens frequency in dfm including also a customized list of phrases-CodePudding

Hello Guys Im wishing you all a very happy new year!!!

I have been wondering if it is possible to perform the feauture_frequency of the powerful quanteda library in R including also a list of phrases or "words" to be accounted for, for instance I have the following data set:

library(quanteda)
library(quanteda.textstats)

df_sample<-c("Word Record",
             "be able to count by word",
             "But also include some phrases such as",
             "World Record Super Bass Mr. President Mr. President")

When I calculate the textstat_frequency of the df_sample I get something like this:

> tokens<-corpus(df_sample) %>% tokens(remove_punct = TRUE)
> dfm<-dfm(tokens)
> 
> quanteda.textstats::textstat_frequency(dfm)
     feature frequency rank docfreq group
1       word         2    1       2   all
2     record         2    1       2   all
3         mr         2    1       1   all
4  president         2    1       1   all
5         be         1    5       1   all
6       able         1    5       1   all
7         to         1    5       1   all
8      count         1    5       1   all
9         by         1    5       1   all
10       but         1    5       1   all
11      also         1    5       1   all
12   include         1    5       1   all
13      some         1    5       1   all
14   phrases         1    5       1   all
15      such         1    5       1   all
16        as         1    5       1   all
17     world         1    5       1   all
18     super         1    5       1   all
19      bass         1    5       1   all
>

which is correct but I also want to change my code in other to take into account and print in the output the words or phrases "Mr. President", "World Record", "Super Bass"

key_lookups<-c("Mr. President", "World Record", "Super Bass" )

How can I use quanteda funs to have in my output along with the previos counts also the frequency of the previous phrases,for example

"Mr. President" 2 "World Record" 2 "Super Bass" 1

Thank you so much for your help, I am a very active and attentive user I will be upvoting and selecting the best answer as soon as the come, Also please if you can reference or quote the source of the documentation where you are driving your answer from please also link it in your ans! THANKS A LOT! HAPPY NEW YEAR

CodePudding user response：

In the quanteda library one could take advange of the fun tokens_compound

library(quanteda)
library(quanteda.textstats)

df_sample<-c("World Record",
             "be able to count by word",
             "But also include some phrases such as",
             "World Record Super Bass Mr. President Mr. President")

toks <- tokens(df_sample,remove_punct = TRUE)

Now lets compound the key_lookups over the toks object

key_lookups<-c("Mr President", "World Record", "Super Bass" )
toks_comp <- tokens_compound(toks, pattern = phrase(key_lookups))

Take a look at the output:

> toks_comp %>% dfm() %>% textstat_frequency()
        feature frequency rank docfreq group
1  world_record         2    1       2   all
2  mr_president         2    1       1   all
3            be         1    3       1   all
4          able         1    3       1   all
5            to         1    3       1   all
6         count         1    3       1   all
7            by         1    3       1   all
8          word         1    3       1   all
9           but         1    3       1   all
10         also         1    3       1   all
11      include         1    3       1   all
12         some         1    3       1   all
13      phrases         1    3       1   all
14         such         1    3       1   all
15           as         1    3       1   all
16   super_bass         1    3       1   all

CodePudding user response：

First: a warning about your example code: do not create objects that have the same name as functions (like tokens and dfm) this will (eventually) lead to errors and is difficult to debug.

There are probably a few ways of doing this. I created a "normal" tokens object and one ngrams tokens object. both turned into dfm's and from the ngrams dfm, I kept the phrases you wanted. Then combined the dfm's and you can use textstat_frequency as normal.

Note: you can't combine tokens objects like you can combine dfm objects.

library(quanteda)
library(quanteda.textstats)

df_sample<-c("Word Record",
             "be able to count by word",
             "But also include some phrases such as",
             "World Record Super Bass Mr. President Mr. President")



my_tokens <- corpus(df_sample) %>% tokens(remove_punct = TRUE)
my_dfm <- dfm(my_tokens)

# No points as they are removed in the dfm
key_lookups<-c("Mr President", "World Record", "Super Bass" )


my_tokens_ngram <- tokens_ngrams(my_tokens, n = 2, concatenator = " ")

my_dfm_ngrams <- dfm(my_tokens_ngram)

# Only keep the lookups
my_dfm_ngrams <- dfm_keep(my_dfm_ngrams, key_lookups)

# Combine both dfms
my_dfms <- rbind(my_dfm, my_dfm_ngrams)

# if necessary uncomment next part
# my_dfms <- dfm_compress(my_dfms)

outcome:

head(textstat_frequency(my_dfms), 5)
       feature frequency rank docfreq group
1         word         2    1       2   all
2       record         2    1       2   all
3           mr         2    1       1   all
4    president         2    1       1   all
5 mr president         2    1       1   all

tail(textstat_frequency(my_dfms), 5)
        feature frequency rank docfreq group
18        world         1    6       1   all
19        super         1    6       1   all
20         bass         1    6       1   all
21 world record         1    6       1   all
22   super bass         1    6       1   all

Do note that using rbind on dfms, creates a new document name like "text1.1". If you want this merged back to the original documents, you can call dfm_compress(my_dfms) first and then call textstat_frequency.