Home > Software engineering >  Extract a sentence from text based on multiple criteria using R
Extract a sentence from text based on multiple criteria using R

Time:10-14

I am trying to extract a sentence from text with multiple rows and multiple sentences per row based on the following criteria:

  1. Contains the word "bonus" or "incentive" (case insensitive)
  2. Sentences can be defined by punctuation, new lines or control characters (\n, \r, etc)

Test data:

text <-  c("This is a sentence. $5k SIGN-ON BONUS offered. This is another sentence. Salary is $15.00 per hours. Another",
         "This is a sentence. Retention bonus of $5,000 offered! This is another sentence. Salary is $15.00 per hours? Another", 
         "This is a sentence. $5k incentive offered! This is another sentence. Salary is $15.00 per hours. Another", 
         "This is a sentence\n \n$5000 sign-on Bonus offered\n \nThis is another sentence\n \nSalary is $15.00 per hours\n \nAnother", 
         "This is a sentence\n\nRetention bonus of $5000 offered\n\nThis is another sentence\n\nSalary is $15.00 per hours\n\nAnother",
         "This is a sentence\n \n$5k incentive offered\n \nThis is another sentence\n Salary is $15.00 per hours\nAnother",
         
         "This is a sentence. 
          $5k signing bonus offered! 
          This is another sentence. 
          Salary is $15.00 per hours? Another", 
         
         "This is a sentence. 
          
          This is another sentence. 
          
          $5k incentive offered! 
          Salary is $15.00 per hours? Another")

My attempt using str_extract from the stringr package doesn't quite get me what I want:

stringr::str_extract(text, "[[:print:]]*(?i)bonus|(?i)incentive[[:print:]]*[[:cntrl:]]|[[:punct:]]")

[1] "This is a sentence. $5k SIGN-ON BONUS" "This is a sentence. Retention bonus"  
[3] "."                                     "$5000 sign-on Bonus"                  
[5] "Retention bonus"                       "incentive offered\n"                  
[7] "."                                     "."

Desired output would be:

[1] "$5k SIGN-ON BONUS offered"                "Retention bonus of $5,000 offered"  
[3] "$5k incentive offered"                    "$5000 sign-on Bonus offered"                  
[5] "Retention bonus of $5000 offered"         "$5k incentive offered"                  
[7] "$5k signing bonus offered"                "$5k incentive offered"

Any suggestions would be much appreciated!

CodePudding user response:

We could split the 'text' at one or more spaces (\\s ) that follows the . or at the newline character, unlist the list elements, and use grep to select those sentences that have the keyword pattern

grep("bonus|incentive", unlist(strsplit(text,
   "(?<=\\.)\\s |\n", perl = TRUE)), value = TRUE, ignore.case = TRUE)

-output

[1] "$5k SIGN-ON BONUS offered."                                   "Retention bonus of $5,000 offered! This is another sentence."
[3] "$5k incentive offered! This is another sentence."             "$5000 sign-on Bonus offered"                                 
[5] "Retention bonus of $5000 offered"                             "$5k incentive offered"                                       
[7] "$5k signing bonus offered! "                                  "$5k incentive offered! "     

CodePudding user response:

Maybe something like this:

library(tidyverse)

tibble(text) %>% 
  separate_rows(text,sep = '\\.|\n|\\!') %>% 
  mutate(text = str_squish(text)) %>%
  filter(text != "" & str_detect(text, fixed("bonus", ignore_case = TRUE)) |
           str_detect(text, fixed("incentive", ignore_case = TRUE)))
  text                             
  <chr>                            
1 $5k SIGN-ON BONUS offered        
2 Retention bonus of $5,000 offered
3 $5k incentive offered            
4 $5000 sign-on Bonus offered      
5 Retention bonus of $5000 offered 
6 $5k incentive offered            
7 $5k signing bonus offered        
8 $5k incentive offered  
  • Related