I'd like to implement a function that produces a matrix, which counts how often a word occurs in a sentence.
an input vector example:
c("I love bananas", "I hate bananas", "I love apples and I hate bananas")
the output should be:
I love hate bananas apples and
1 1 0 1 0 0
1 0 1 1 0 0
2 1 1 1 1 1
My attempt for the function looks like this:
wordextract <- function(sentences) {
words <- unique(unlist(strsplit(sentences," ")))
bag <- map(words, ~ str_count(sentences, regex(.x, ignore_case = TRUE))) %>%
do.call(cbind, .)
bag %>% rename_with(~ words)
}
The problem occurs when adding col names from the words vector inside the function. But still I think this problem can be solved much more efficient.
CodePudding user response:
If you're just looking for a function that will do this, you could use:
corpus <- c("I love bananas", "I hate bananas", "I love apples and I hate bananas")
library(quanteda)
dfm(tokens(corpus))
#> Document-feature matrix of: 3 documents, 6 features (33.33% sparse) and 0 docvars.
#> features
#> docs i love bananas hate apples and
#> text1 1 1 1 0 0 0
#> text2 1 0 1 1 0 0
#> text3 2 1 1 1 1 1
Alternatively, if you want to get your function working, you could put as.data.frame()
in the pipeline before the rename_with()
function and that should do it:
library(dplyr)
library(stringr)
library(purrr)
wordextract <- function(sentences) {
words <- unique(unlist(strsplit(sentences," ")))
bag <- map(words, ~ str_count(sentences, regex(.x, ignore_case = TRUE))) %>%
do.call(cbind, .)
bag %>% as.data.frame %>% rename_with(~words)
}
wordextract(corpus)
#> I love bananas hate apples and
#> 1 1 1 1 0 0 0
#> 2 1 0 1 1 0 0
#> 3 2 1 1 1 1 1
Created on 2022-11-06 by the reprex package (v2.0.1)