Home > Enterprise >  Document Term Matrix with two dataframes in R
Document Term Matrix with two dataframes in R

Time:10-05

I want to create a document term in R using two dataframes.

For example, the first dataframe contains the text.

df1

category     text
person1      "hello word I like turtles"
person2      "re: turtles! I think turtles are stellar!"
person3      "sunflowers are nice."

The second dataframe has a column with all of the terms of interest.

df2

col1    term
x       turtles
y       hello
w       sunflowers
f       I

The resulting matrix would show each persons use of each word in df2$terms.

results

category    turtles     hello     sunflowers     I         
person1       1           1            0         1
person2       2           0            0         1
person3       0           0            1         0

help!

CodePudding user response:

Here's a hack using regular expressions, apply and merge:

category = c('person1','person2','person3')
text=c("hello word I like turtles", "re: turtles! I think turtles are stellar!", "sunflowers are nice.")
df1 = data.frame(category=category,text=text)
df2 = data.frame(term=c('turtles','hello','sunflowers','I'))

f = function(pattern){
  
  patterncount = function(x){ # Counts occurrence of pattern in a string
    
    if (grepl(x, pattern=pattern)){
      
      length(gregexpr(x, pattern=pattern)[[1]])
    } else{
      0
    }
  }
  sapply(df1$text, FUN = patterncount)
  
}

df3 = data.frame(sapply(df2$term, FUN=f))
df3$text = row.names(df3)

result = merge(df1, df3, by='text')

CodePudding user response:

Using str_count from stringr -

library(stringr)
cbind(df1[1], sapply(df2$term, function(x) str_count(df1$text, x)))

#  category turtles hello sunflowers I
#1  person1       1     1          0 1
#2  person2       2     0          0 1
#3  person3       0     0          1 0
  • Related