I am trying to extract a sentence from text with multiple rows and multiple sentences per row based on the following criteria:
- Contains the word "bonus" or "incentive" (case insensitive)
- Sentences can be defined by punctuation, new lines or control characters (\n, \r, etc)
Test data:
text <- c("This is a sentence. $5k SIGN-ON BONUS offered. This is another sentence. Salary is $15.00 per hours. Another",
"This is a sentence. Retention bonus of $5,000 offered! This is another sentence. Salary is $15.00 per hours? Another",
"This is a sentence. $5k incentive offered! This is another sentence. Salary is $15.00 per hours. Another",
"This is a sentence\n \n$5000 sign-on Bonus offered\n \nThis is another sentence\n \nSalary is $15.00 per hours\n \nAnother",
"This is a sentence\n\nRetention bonus of $5000 offered\n\nThis is another sentence\n\nSalary is $15.00 per hours\n\nAnother",
"This is a sentence\n \n$5k incentive offered\n \nThis is another sentence\n Salary is $15.00 per hours\nAnother",
"This is a sentence.
$5k signing bonus offered!
This is another sentence.
Salary is $15.00 per hours? Another",
"This is a sentence.
This is another sentence.
$5k incentive offered!
Salary is $15.00 per hours? Another")
My attempt using str_extract from the stringr package doesn't quite get me what I want:
stringr::str_extract(text, "[[:print:]]*(?i)bonus|(?i)incentive[[:print:]]*[[:cntrl:]]|[[:punct:]]")
[1] "This is a sentence. $5k SIGN-ON BONUS" "This is a sentence. Retention bonus"
[3] "." "$5000 sign-on Bonus"
[5] "Retention bonus" "incentive offered\n"
[7] "." "."
Desired output would be:
[1] "$5k SIGN-ON BONUS offered" "Retention bonus of $5,000 offered"
[3] "$5k incentive offered" "$5000 sign-on Bonus offered"
[5] "Retention bonus of $5000 offered" "$5k incentive offered"
[7] "$5k signing bonus offered" "$5k incentive offered"
Any suggestions would be much appreciated!
CodePudding user response:
We could split the 'text' at one or more spaces (\\s
) that follows the .
or at the newline character, unlist
the list
elements, and use grep
to select those sentences that have the keyword pattern
grep("bonus|incentive", unlist(strsplit(text,
"(?<=\\.)\\s |\n", perl = TRUE)), value = TRUE, ignore.case = TRUE)
-output
[1] "$5k SIGN-ON BONUS offered." "Retention bonus of $5,000 offered! This is another sentence."
[3] "$5k incentive offered! This is another sentence." "$5000 sign-on Bonus offered"
[5] "Retention bonus of $5000 offered" "$5k incentive offered"
[7] "$5k signing bonus offered! " "$5k incentive offered! "
CodePudding user response:
Maybe something like this:
library(tidyverse)
tibble(text) %>%
separate_rows(text,sep = '\\.|\n|\\!') %>%
mutate(text = str_squish(text)) %>%
filter(text != "" & str_detect(text, fixed("bonus", ignore_case = TRUE)) |
str_detect(text, fixed("incentive", ignore_case = TRUE)))
text
<chr>
1 $5k SIGN-ON BONUS offered
2 Retention bonus of $5,000 offered
3 $5k incentive offered
4 $5000 sign-on Bonus offered
5 Retention bonus of $5000 offered
6 $5k incentive offered
7 $5k signing bonus offered
8 $5k incentive offered