I want to extract the dates along with a regex pattern (dates come after the pattern) from sentences.
text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."
text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."
The pattern is number of subscribers
and then there is the date as Month Day, Year
format. Sometimes there are as of
or in
or no characters
between the pattern and dates.
I have tried the following script.
find_dates <- function(text){
pattern <- "\\bnumber\\s of\\s subscribers\\s (\\S (?:\\s \\S ){3})" # pattern and next 3 words
str_extract(text, pattern)
}
However, this extracts the in-between words too, which I would like to ignore.
Desired output:
find_dates(text1)
'number of subscribers December 31, 2022'
find_dates(text2)
'number of subscribers January 10, 2023'
CodePudding user response:
An approach using stringr
library(stringr)
find_Dates <- function(x) paste0(str_extract_all(x,
"\\bnumber\\b (\\b\\S \\b ){2}|\\b\\S \\b \\d{2}, \\d{4}")[[1]], collapse="")
find_Dates(text1)
[1] "number of subscribers December 31, 2022"
# all texts
lapply(c(text1, text2), find_Dates)
[[1]]
[1] "number of subscribers December 31, 2022"
[[2]]
[1] "number of subscribers January 10, 2023"
CodePudding user response:
text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."
text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."
find_dates <- function(text){
# pattern <- "(\\bnumber\\s of\\s subscribers)\\s (\\S (?:\\s \\S ){3})" # pattern and next 3 words
pattern <- "(\\bnumber\\s of\\s subscribers)(?:\\s as\\s of\\s|\\s in\\s )?(\\S (\\s \\S ){2})" # pattern and next 3 words
str_extract(text, pattern, 1:2)
}
find_dates(text1)
# [1] "number of subscribers" "December 31, 2022"
find_dates(text2)
# [1] "number of subscribers" "January 10, 2023"