Bag of Words Matrix out of sentence vector-CodePudding

I'd like to implement a function that produces a matrix, which counts how often a word occurs in a sentence.

an input vector example:

c("I love bananas", "I hate bananas", "I love apples and I hate bananas")

the output should be:

 I       love    hate    bananas   apples    and    
 1        1       0         1        0        0
 1        0       1         1        0        0       
 2        1       1         1        1        1

My attempt for the function looks like this:

wordextract <- function(sentences) {
  words <- unique(unlist(strsplit(sentences," ")))
  bag <- map(words, ~ str_count(sentences, regex(.x, ignore_case = TRUE))) %>%
    do.call(cbind, .) 
  bag %>% rename_with(~ words)
}

The problem occurs when adding col names from the words vector inside the function. But still I think this problem can be solved much more efficient.

CodePudding user response：

If you're just looking for a function that will do this, you could use:

corpus <- c("I love bananas", "I hate bananas", "I love apples and I hate bananas")
library(quanteda)
dfm(tokens(corpus))
#> Document-feature matrix of: 3 documents, 6 features (33.33% sparse) and 0 docvars.
#>        features
#> docs    i love bananas hate apples and
#>   text1 1    1       1    0      0   0
#>   text2 1    0       1    1      0   0
#>   text3 2    1       1    1      1   1

Alternatively, if you want to get your function working, you could put as.data.frame() in the pipeline before the rename_with() function and that should do it:

library(dplyr)
library(stringr)
library(purrr)

wordextract <- function(sentences) {
  words <- unique(unlist(strsplit(sentences," ")))
  bag <- map(words, ~ str_count(sentences, regex(.x, ignore_case = TRUE))) %>%
    do.call(cbind, .) 
  bag %>% as.data.frame %>% rename_with(~words)
}
wordextract(corpus)
#>   I love bananas hate apples and
#> 1 1    1       1    0      0   0
#> 2 1    0       1    1      0   0
#> 3 2    1       1    1      1   1

^{Created on 2022-11-06 by the reprex package (v2.0.1)}