In the example below I am trying to extract the text between 'Supreme Court' or 'Supreme Court of the United States' and the next date (including the date). The result below is not what I intended since result 2 includes "of the United States".
I assume the error is due to the .*?
part since .
can also match 'of the United States'. Any ideas how to exclude it?
I guess more generally speaking, the question is how to include an optional 'element' into a lookbehind (which seems not to be possible since ?
makes it a non-fixed length input).
Many thanks!
library(tidyverse)
txt <- c("The US Supreme Court decided on 2 April 2020 The Supreme Court of the United States decided on 5 March 2011 also.")
str_extract_all(txt, regex("(?<=Supreme Court)(\\sof the United States)?.*?\\d{1,2}\\s\\w \\s\\d{2,4}"))
#> [[1]]
#> [1] " decided on 2 April 2020"
#> [2] " of the United States decided on 5 March 2011"
Created on 2021-12-09 by the reprex package (v2.0.1)
I also tried
str_extract_all(txt, regex("(?<=(Supreme Court)|(Supreme Court of the United States)).*?\\d{1,2}\\s\\w \\s\\d{2,4}"))
however the result is the same.
CodePudding user response:
You can do this with str_match_all and group capture:
str_match_all(txt, regex("Supreme Court(?:\\sof the United States)?(.*?\\d{1,2}\\s\\w \\s\\d{2,4})")) %>%
.[[1]] %>% .[, 2]
[1] " decided on 2 April 2020" " decided on 5 March 2011"
CodePudding user response:
In this case, I would prefer using the perl
engine which is implemented in Base R, rather than using the ICU-library engine which stringr/stringi uses.
pattern <- "Supreme Court (of the United States ?)?\\K.*?\\d{1,2}\\s\\w \\s\\d{2,4}"
regmatches(txt, gregexpr(pattern, txt, perl = TRUE))
[[1]]
[1] "decided on 2 April 2020" "decided on 5 March 2011"