Home > Mobile >  Optional pattern part in regex lookbehind
Optional pattern part in regex lookbehind

Time:12-09

In the example below I am trying to extract the text between 'Supreme Court' or 'Supreme Court of the United States' and the next date (including the date). The result below is not what I intended since result 2 includes "of the United States".

I assume the error is due to the .*? part since . can also match 'of the United States'. Any ideas how to exclude it? I guess more generally speaking, the question is how to include an optional 'element' into a lookbehind (which seems not to be possible since ? makes it a non-fixed length input). Many thanks!

library(tidyverse)
txt <- c("The US Supreme Court decided on 2 April 2020 The Supreme Court of the United States decided on 5 March 2011 also.")

str_extract_all(txt, regex("(?<=Supreme Court)(\\sof the United States)?.*?\\d{1,2}\\s\\w \\s\\d{2,4}"))
#> [[1]]
#> [1] " decided on 2 April 2020"                     
#> [2] " of the United States decided on 5 March 2011"

Created on 2021-12-09 by the reprex package (v2.0.1)

I also tried

   str_extract_all(txt, regex("(?<=(Supreme Court)|(Supreme Court of the United States)).*?\\d{1,2}\\s\\w \\s\\d{2,4}"))

however the result is the same.

CodePudding user response:

You can do this with str_match_all and group capture:

str_match_all(txt, regex("Supreme Court(?:\\sof the United States)?(.*?\\d{1,2}\\s\\w \\s\\d{2,4})")) %>% 
  .[[1]] %>% .[, 2]

[1] " decided on 2 April 2020" " decided on 5 March 2011"

CodePudding user response:

In this case, I would prefer using the perl engine which is implemented in Base R, rather than using the ICU-library engine which stringr/stringi uses.

pattern <- "Supreme Court (of the United States ?)?\\K.*?\\d{1,2}\\s\\w \\s\\d{2,4}"
regmatches(txt, gregexpr(pattern, txt, perl = TRUE))

[[1]]
[1] "decided on 2 April 2020" "decided on 5 March 2011"
  • Related