Home > Software design >  Creating a Token count by date and co-occurence term proportion by date using quanteda
Creating a Token count by date and co-occurence term proportion by date using quanteda

Time:12-29

Hello Guys I have a quite massive data sets that cointais reviews of utilities services from customers all over the UK, this is a small sample of what the data looks like

df <- data.frame (text  = c("The investors and their supporters shall invest and do something mostly invest",
         " Shall we tell the investors to invest ?",  "Investors shall invest.",
         "Investors may sometimes invest","spend what Investor Do"),
                  date = c("10/12/2022", "10/12/2022", "10/12/2022","11/12/2022","12/12/2022"))
  1. What I want is to be able to count the frequency of terms/words/tokens by date

for instance the word invest appears 6 times in total, so for the date 10/12/2022 its word count is 4 I wanna be able to use the quanteda library (since it is so powerfull) to achieve the count and plot the viz over date and

  1. I also wanna plot the association or co-occurence of the word investor & invest over date

for instance we have in this example 5 reviews in those reviews 4/5 times the word invest and investor were present and I'd like to plot that percentage over date as well is that is possible or what amazing options does the quantada lib has that can perfomre this task? will it be possible to also find lets say a min porcentage of the 0.25 most frequente qords that appear when "invest" appears

to achive the first point I started witht he following code:

df %>% 
  corpus(text_field="text") %>% 
  dfm() %>%
  textstat_frequency(10)

which gives:

      feature frequency rank docfreq group
1      invest         6    1       5   all
2   investors         4    2       4   all
3       shall         3    3       3   all
4         the         2    4       2   all
5         and         2    4       1   all
6          do         2    4       2   all
7       their         1    7       1   all
8  supporters         1    7       1   all
9   something         1    7       1   all
10         we         1    7       1   all
Warning message:
'dfm.corpus()' is deprecated. Use 'tokens()' first. 

by how would I go about plotting the frequency of this words over the date column, I read in the documentation that one can group but I had have no luck in doing so

and for the second question I dont know for sure if what funtion of the quenteda lib to use but I am trying to mirror the tm::findAssocs() fun from the tm library

I am super attentive to your answers guys I will be upvoting and picking the answer as soons as they come THANKS A TRILLION for your help it really mena sthe world to me

CodePudding user response:

Answer to your first question:

The dates are put into the docvars part of your corpus. This can be used within the textstat_frequency with the group option.

dat <- data.frame (text  = c("The investors and their supporters shall invest and do something mostly invest",
                            " Shall we tell the investors to invest ?",  "Investors shall invest.",
                            "Investors may sometimes invest","spend what Investor Do"),
                  date = c("10/12/2022", "10/12/2022", "10/12/2022","11/12/2022","12/12/2022"))


library(dplyr)
library(quanteda)
library(quanteda.textstats)

dat %>% 
  corpus(text_field="text") %>% 
  tokens() %>%
  dfm() %>% 
  textstat_frequency(groups = date)

      feature frequency rank docfreq      group
1      invest         4    1       3 10/12/2022
2   investors         3    2       3 10/12/2022
3       shall         3    2       3 10/12/2022
4         the         2    4       2 10/12/2022
5         and         2    4       1 10/12/2022
6       their         1    6       1 10/12/2022
7  supporters         1    6       1 10/12/2022
8          do         1    6       1 10/12/2022
9   something         1    6       1 10/12/2022
10     mostly         1    6       1 10/12/2022
11         we         1    6       1 10/12/2022
12       tell         1    6       1 10/12/2022
13         to         1    6       1 10/12/2022
14          ?         1    6       1 10/12/2022
15          .         1    6       1 10/12/2022
16  investors         1    1       1 11/12/2022
17     invest         1    1       1 11/12/2022
18        may         1    1       1 11/12/2022
19  sometimes         1    1       1 11/12/2022
20         do         1    1       1 12/12/2022
21      spend         1    1       1 12/12/2022
22       what         1    1       1 12/12/2022
23   investor         1    1       1 12/12/2022

You now have access to the frequency per day.

As for question 2, I think you can use textstat_simil. Something like below. It does give some different answers as using tm::findAssoc, usually more features. So I'm not completely sure if this is the correct answer. Maybe someone from the quanteda team can confirm or deny this.

my_dfm <- dat %>% 
  corpus(text_field="text") %>% 
  tokens() %>%
  dfm()

textstat_simil(my_dfm, 
               my_dfm[, c("investor")], 
               method = "correlation", 
               margin = "features",
               min_simil = 0.7)

textstat_simil object; method = "correlation"
           investor
the               .
investors         .
and               .
their             .
supporters        .
shall             .
invest            .
do                .
something         .
mostly            .
we                .
tell              .
to                .
?                 .
.                 .
may               .
sometimes         .
spend             1
what              1
investor          1

You can save the outcome of textstat_simil as a data.frame or list if you want to with as.data.frame or as.list.

  • Related