I wrote a code to count the appearance of words in a data frame:
Items <- c('decid*','head', 'heads', 'decid*')
df1<-data.frame(Items)
words<- c('head', 'heads', 'decided', 'decides', 'top')
df_main<-data.frame(words)
item <- vector()
count <- vector()
for (i in 1:length(unique(Items))){
item[i] <- Items[i]
count[i]<- sum(df_main$words == item[i])}
word_freq <- data.frame(cbind(item, count))
word_freq
However, the results are like this:
item | count | |
---|---|---|
1 | decid* | 0 |
2 | head | 1 |
3 | heads | 1 |
As you see, it does not correctly count for "decid*". The actual results I expect should be like this:
item | count | |
---|---|---|
1 | decid* | 2 |
2 | head | 1 |
3 | heads | 1 |
I think I need to change the item word (decid*) format, however, I could not figure it out. Any help is much appreciated!
CodePudding user response:
I think you want to use decid*
as regex pattern. ==
looks for an exact match, you may use grepl
to look for a particular pattern.
I have used sapply
as an alternative to for
loop.
result <- stack(sapply(unique(df1$Items), function(x) {
if(grepl('*', x, fixed = TRUE)) sum(grepl(x, df_main$word))
else sum(x == df_main$words)
}))
result
# values ind
#1 2 decid*
#2 1 head
#3 1 heads
CodePudding user response:
Perhaps as an alternative approach altogether: instead of creating a new dataframe word_freq
, why not create a new column in df_main
(if that's your "main" dataframe) which indicates the number of matches of your (apparently key)Items
:
library(stringr)
df_main$count <- apply(df_main, 1, function(x) sum(str_count(x, Items)))
Result:
df_main
words Items_count
1 head 1
2 heads 2
3 decided 2
4 decides 2
5 top 0