Home > database >  How to extract patterns along with dates in string using R?
How to extract patterns along with dates in string using R?

Time:01-19

I want to extract the dates along with a regex pattern (dates come after the pattern) from sentences.

text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."

text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."

The pattern is number of subscribers and then there is the date as Month Day, Year format. Sometimes there are as of or in or no characters between the pattern and dates.

I have tried the following script.

find_dates <- function(text){
  
  pattern <- "\\bnumber\\s of\\s subscribers\\s (\\S (?:\\s \\S ){3})" # pattern and next 3 words

  str_extract(text, pattern)

}

However, this extracts the in-between words too, which I would like to ignore.

Desired output:

find_dates(text1)

'number of subscribers December 31, 2022'

find_dates(text2)

'number of subscribers January 10, 2023'

CodePudding user response:

An approach using stringr

library(stringr)

find_Dates <- function(x) paste0(str_extract_all(x, 
  "\\bnumber\\b (\\b\\S \\b ){2}|\\b\\S \\b \\d{2}, \\d{4}")[[1]], collapse="")
find_Dates(text1)
[1] "number of subscribers December 31, 2022"

# all texts
lapply(c(text1, text2), find_Dates)
[[1]]
[1] "number of subscribers December 31, 2022"

[[2]]
[1] "number of subscribers January 10, 2023"

CodePudding user response:

text1 <- "The number of subscribers as of December 31, 2022 of Netflix increased by 1.15% compared to the previous year."
text2 <- "Netflix's number of subscribers in January 10, 2023 has grown more than 1.50%."


find_dates <- function(text){
  # pattern <- "(\\bnumber\\s of\\s subscribers)\\s (\\S (?:\\s \\S ){3})" # pattern and next 3 words
  pattern <- "(\\bnumber\\s of\\s subscribers)(?:\\s as\\s of\\s|\\s in\\s )?(\\S (\\s \\S ){2})" # pattern and next 3 words
  str_extract(text, pattern, 1:2)

}

find_dates(text1)
# [1] "number of subscribers" "December 31, 2022"    

find_dates(text2)
# [1] "number of subscribers" "January 10, 2023"
  • Related