Word frequency over time : How to count the word frequency by date?-CodePudding

I have a data frame look like this :

date	text
201901	Thank you for helping me
201902	You are amazing
201902	For helping with this

My aim is to calculate the word frequency in each line, and eventually look like this:

date	thank	you	for	helping	me	are	amazing	with	this	for
201901	1	1	1	1	1	0	0	0	0	0
201902	0	1	1	1	0	1	1	1	1	1

The actual data set is like this frame, but contains millions of text lines. So I was wondering how to automate this process using R, without typing all those texts lines.

CodePudding user response：

Using R and tidyverse:

df <- data.frame(date = c(201901, 201902, 201902),
                 text = c("Thank you for helping me", "You are amazing", "For helping with this"))

library(tidyverse)

If you want your data as a table of counts

df %>% 
            separate_rows(text, sep = " ") %>% 
            mutate(text = tolower(text)) %>% 
            table()

Output:

text
date     amazing are for helping me thank this with you
  201901       0   0   1       1  1     1    0    0   1
  201902       1   1   1       1  0     0    1    1   1

If you want your output as a tibble

df %>% 
        separate_rows(text, sep = " ") %>% 
        mutate(text = tolower(text)) %>% 
        table() %>% 
        as_tibble() %>% 
        pivot_wider(names_from = text, values_from = n)

Output:

# A tibble: 2 x 10
  date   amazing   are `for` helping    me thank  this  with   you
  <chr>    <int> <int> <int>   <int> <int> <int> <int> <int> <int>
1 201901       0     0     1       1     1     1     0     0     1
2 201902       1     1     1       1     0     0     1     1     1

edit: To transform everything to lowercase as your desired output and to show you the output

edit2: To show you that you can also get your data as a tibble to further work with it

CodePudding user response：

library(tidyverse)
library(tidytext)

df <- tibble(
  date = c("201901", "201902", "201902"),
  text = c("Thank you for helping me", 
           "You are amazing", 
           "For helping with this")
)

# A tibble: 3 x 2
  date   text                    
  <chr>  <chr>                   
1 201901 Thank you for helping me
2 201902 You are amazing         
3 201902 For helping with this

df %>%  
  unnest_tokens("words", text) %>% 
  group_by(date, words) %>% 
  summarise(count = n()) %>% 
  ungroup() %>% 
  spread(words, count)

# A tibble: 2 x 10
  date   amazing   are for helping    me thank  this  with   you
  <chr>    <int> <int> <int>   <int> <int> <int> <int> <int> <int>
1 201901      NA    NA     1       1     1     1    NA    NA     1
2 201902       1     1     1       1    NA    NA     1     1     1