For a large scale text analysis problem, I have a data frame containing words that fall into different categories, and a data frame containing a column with strings and (empty) counting columns for each category. I now want to take each individual string, check which of the defined words appear, and count them within the appropriate category.
As a simplified example, given the two data frames below, i want to count how many of each animal type appear in the text cell.
df_texts <- tibble(
text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the
grasshopper"),
mammals=NA,
reptiles=NA,
birds=NA,
insects=NA
)
df_animals <- tibble(animals=c("ape", "fox", "tortoise", "hare", "owl", "grasshopper"),
type=c("mammal", "mammal", "reptile", "mammal", "bird", "insect"))
So my desired result would be:
df_result <- tibble(
text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the
grasshopper"),
mammals=c(2,1,0),
reptiles=c(0,1,0),
birds=c(0,0,1),
insects=c(0,0,1)
)
Is there a straightforward way to achieve this keyword-matching-and-counting that would be applicable to a much larger dataset?
Thanks in advance!
CodePudding user response:
Here's a way do to it in the tidyverse
. First look at whether strings in df_texts$text
contain animals, then count them and sum by text and type.
library(tidyverse)
cbind(df_texts[, 1], sapply(df_animals$animals, grepl, df_texts$text)) %>%
pivot_longer(-text, names_to = "animals") %>%
left_join(df_animals) %>%
group_by(text, type) %>%
summarise(sum = sum(value)) %>%
pivot_wider(id_cols = text, names_from = type, values_from = sum)
text bird insect mammal reptile
<chr> <int> <int> <int> <int>
1 "the ape and the fox" 0 0 2 0
2 "the owl and the the \n grasshopper" 1 0 0 0
3 "the tortoise and the hare" 0 0 1 1
To account for the several occurrences per text:
cbind(df_texts[, 1], t(sapply(df_texts$text, str_count, df_animals$animals, USE.NAMES = F))) %>%
setNames(c("text", df_animals$animals)) %>%
pivot_longer(-text, names_to = "animals") %>%
left_join(df_animals) %>%
group_by(text, type) %>%
summarise(sum = sum(value)) %>%
pivot_wider(id_cols = text, names_from = type, values_from = sum)