classify text using 'from text' and 'to text' in the dataframe in R-CodePudding

Here is my toy data (Note that in my original data, I have 100s of rule sets i.e. such from to combinations):

rule_set <- tibble::tribble(
      ~if_you_see,  ~write,
    "honda civic",   "car",
  "toyota camery",   "car",
      "honda CRV",   "car",
            "boy", "human",
           "girl", "human")

Following is the dataframe with the text to be classified using the rule set in the above data frame. So, if you see the car names (given in the above data frame) in the below text, you write 'car' and if you see the entries boy or girl, you write 'human'.

to_be_classified <- tibble::tribble(
                                        ~text,
       "I have a honda civic. It pretty good.",
  "The toyota camery is also good. I like it",
                   "That boy has a honda CRV.",
           "You see that girl in blue dress?")

Here is the desired output:

desired_output <- tibble::tribble(
                                        ~text,       ~class,
       "I have a honda civic. It pretty good.",        "car",
  "The toyota camery is also good. I like it",        "car",
                  "That boy has a honda CRV.", c("human, car"),
           "You see that girl in blue dress?",      "human")

I started thinking on the following lines, but could not move forward.

library(tidyverse)
to_be_classified %>% 
      mutate(class = if_else(str_detect(text, pattern = rule_set$if_you_see), if_you_see$write))

Please advise, preferably with code involving dplyr and stringr functions.

CodePudding user response：

I'm not familiarized with stringr functions, but I could suggest this line of code:

to_be_classified %>% 
  
  mutate(class = sapply(rule_set$if_you_see, grepl, x = text) %>% 
           apply(1, function(x) paste(rule_set$write[x], collapse = ", ")))

Using a str_detect function instead of grepl:

to_be_classified %>% 
  
  mutate(class = sapply(rule_set$if_you_see, str_detect, string = text) %>% 
           apply(1, function(x) paste(rule_set$write[x], collapse = ", ")))

CodePudding user response：

Here's a 'tidy' solution:

Use setNames to form terminological pairs:

replacements <- setNames(rule_set$write, rule_set$if_you_see)

Use replacements in a pipe:

library(tidyverse)

to_be_classified %>%
  mutate(
    # replace values in `if_you_see` with values in `write`:
    class = str_replace_all(text, replacements),
    # extract and list occurrences of "car" and "human":
    class = lapply(str_extract_all(class, "car|human"), toString)) %>%
  # unnest listed items:
  unnest(where(is.list))

# A tibble: 4 × 2
  text                                      class     
  <chr>                                     <chr>     
1 I have a honda civic. It pretty good.     car       
2 The toyota camery is also good. I like it car       
3 That boy has a honda CRV.                 human, car
4 You see that girl in blue dress?          human

CodePudding user response：

library(tidyverse)


patterns_df <- rule_set %>% 
  group_by(write) %>% 
  summarise(
    patterns = paste(if_you_see, collapse = "|")
  )


to_be_classified %>% 
  rowwise() %>% 
  mutate(
    class = list(str_detect(text, patterns_df$patterns) %>% 
      patterns_df$write[.]),
    class = paste(unlist(class), collapse = ", ")
  ) %>% 
  ungroup()

#> # A tibble: 4 × 2
#>   text                                      class     
#>   <chr>                                     <chr>     
#> 1 I have a honda civic. It pretty good.     car       
#> 2 The toyota camery is also good. I like it car       
#> 3 That boy has a honda CRV.                 car, human
#> 4 You see that girl in blue dress?          human