number of matches for keywords in specified categories-CodePudding

For a large scale text analysis problem, I have a data frame containing words that fall into different categories, and a data frame containing a column with strings and (empty) counting columns for each category. I now want to take each individual string, check which of the defined words appear, and count them within the appropriate category.

As a simplified example, given the two data frames below, i want to count how many of each animal type appear in the text cell.

df_texts <- tibble(
  text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the 
  grasshopper"),
  mammals=NA,
  reptiles=NA,
  birds=NA,
  insects=NA
)

df_animals <- tibble(animals=c("ape", "fox", "tortoise", "hare", "owl", "grasshopper"),
           type=c("mammal", "mammal", "reptile", "mammal", "bird", "insect"))

So my desired result would be:

df_result <- tibble(
  text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the 
  grasshopper"),
  mammals=c(2,1,0),
  reptiles=c(0,1,0),
  birds=c(0,0,1),
  insects=c(0,0,1)
)

Is there a straightforward way to achieve this keyword-matching-and-counting that would be applicable to a much larger dataset?

Thanks in advance!

CodePudding user response：

Here's a way do to it in the tidyverse. First look at whether strings in df_texts$text contain animals, then count them and sum by text and type.

library(tidyverse)

cbind(df_texts[, 1], sapply(df_animals$animals, grepl, df_texts$text)) %>% 
  pivot_longer(-text, names_to = "animals") %>% 
  left_join(df_animals) %>% 
  group_by(text, type) %>% 
  summarise(sum = sum(value)) %>% 
  pivot_wider(id_cols = text, names_from = type, values_from = sum)

  text                                   bird insect mammal reptile
  <chr>                                 <int>  <int>  <int>   <int>
1 "the ape and the fox"                     0      0      2       0
2 "the owl and the the \n  grasshopper"     1      0      0       0
3 "the tortoise and the hare"               0      0      1       1

To account for the several occurrences per text:

cbind(df_texts[, 1], t(sapply(df_texts$text, str_count, df_animals$animals, USE.NAMES = F))) %>% 
  setNames(c("text", df_animals$animals)) %>% 
  pivot_longer(-text, names_to = "animals") %>% 
  left_join(df_animals) %>% 
  group_by(text, type) %>% 
  summarise(sum = sum(value)) %>% 
  pivot_wider(id_cols = text, names_from = type, values_from = sum)