I want to create a document term in R using two dataframes.
For example, the first dataframe contains the text.
df1
category text
person1 "hello word I like turtles"
person2 "re: turtles! I think turtles are stellar!"
person3 "sunflowers are nice."
The second dataframe has a column with all of the terms of interest.
df2
col1 term
x turtles
y hello
w sunflowers
f I
The resulting matrix would show each persons use of each word in df2$terms
.
results
category turtles hello sunflowers I
person1 1 1 0 1
person2 2 0 0 1
person3 0 0 1 0
help!
CodePudding user response:
Here's a hack using regular expressions, apply and merge:
category = c('person1','person2','person3')
text=c("hello word I like turtles", "re: turtles! I think turtles are stellar!", "sunflowers are nice.")
df1 = data.frame(category=category,text=text)
df2 = data.frame(term=c('turtles','hello','sunflowers','I'))
f = function(pattern){
patterncount = function(x){ # Counts occurrence of pattern in a string
if (grepl(x, pattern=pattern)){
length(gregexpr(x, pattern=pattern)[[1]])
} else{
0
}
}
sapply(df1$text, FUN = patterncount)
}
df3 = data.frame(sapply(df2$term, FUN=f))
df3$text = row.names(df3)
result = merge(df1, df3, by='text')
CodePudding user response:
Using str_count
from stringr
-
library(stringr)
cbind(df1[1], sapply(df2$term, function(x) str_count(df1$text, x)))
# category turtles hello sunflowers I
#1 person1 1 1 0 1
#2 person2 2 0 0 1
#3 person3 0 0 1 0