I have a data frame look like this :
date | text |
---|---|
201901 | Thank you for helping me |
201902 | You are amazing |
201902 | For helping with this |
My aim is to calculate the word frequency in each line, and eventually look like this:
date | thank | you | for | helping | me | are | amazing | with | this | for |
---|---|---|---|---|---|---|---|---|---|---|
201901 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
201902 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
The actual data set is like this frame, but contains millions of text lines. So I was wondering how to automate this process using R, without typing all those texts lines.
CodePudding user response:
Using R and tidyverse:
df <- data.frame(date = c(201901, 201902, 201902),
text = c("Thank you for helping me", "You are amazing", "For helping with this"))
library(tidyverse)
If you want your data as a table of counts
df %>%
separate_rows(text, sep = " ") %>%
mutate(text = tolower(text)) %>%
table()
Output:
text
date amazing are for helping me thank this with you
201901 0 0 1 1 1 1 0 0 1
201902 1 1 1 1 0 0 1 1 1
If you want your output as a tibble
df %>%
separate_rows(text, sep = " ") %>%
mutate(text = tolower(text)) %>%
table() %>%
as_tibble() %>%
pivot_wider(names_from = text, values_from = n)
Output:
# A tibble: 2 x 10
date amazing are `for` helping me thank this with you
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 201901 0 0 1 1 1 1 0 0 1
2 201902 1 1 1 1 0 0 1 1 1
edit: To transform everything to lowercase as your desired output and to show you the output
edit2: To show you that you can also get your data as a tibble to further work with it
CodePudding user response:
library(tidyverse)
library(tidytext)
df <- tibble(
date = c("201901", "201902", "201902"),
text = c("Thank you for helping me",
"You are amazing",
"For helping with this")
)
# A tibble: 3 x 2
date text
<chr> <chr>
1 201901 Thank you for helping me
2 201902 You are amazing
3 201902 For helping with this
df %>%
unnest_tokens("words", text) %>%
group_by(date, words) %>%
summarise(count = n()) %>%
ungroup() %>%
spread(words, count)
# A tibble: 2 x 10
date amazing are for helping me thank this with you
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 201901 NA NA 1 1 1 1 NA NA 1
2 201902 1 1 1 1 NA NA 1 1 1