I need to extract numbers followed by an A, until pattern " X " appears:
"50A ABC DE 51A FG X 52A HI 53A"
The regex \d A(?=.* X )
correctly matches 50A
and 51A
, because they appear before X
, as shown here.
However, if a string does not have the X
pattern, the regex won´t match any of the desired pattens (50A, 51A, 52A and 53A)
"50A ABC DE 51A FG 52A HI 53A" # no X here
How do I fix that?
CodePudding user response:
You may use a PCRE regex like
\G(?:(?! X ).)*?\K\b\d A\b
See the regex demo. Details:
\G
- start of string or end of the previous successful match (to only ensure consecutive matches)(?:(?! X ).)*?
- any char, other than line break char, as few as possible, that does not start a spaceX
space char sequence\K
- a match reset operator that discards all text matched so far\b\d A\b
- one or more digits andA
inside word boundaries.
In R, you can use the following base R code:
x <- "50A ABC DE 51A FG 52A HI 53A"
rx <- "\\G(?:(?! X ).)*?\\K\\b\\d A\\b"
regmatches(x, gregexpr(rx, x, perl=TRUE))
# => [[1]]
# [1] "50A" "51A" "52A" "53A"
x <- "50A ABC DE 51A FG X 52A HI 53A"
regmatches(x, gregexpr(rx, x, perl=TRUE))
# => [[1]]
# [1] "50A" "51A"
You can remove all after an X
word, and then extract:
x <- "50A ABC DE 51A FG X 52A HI 53A"
library(stringr)
str_extract_all(sub("(\\s|^)X(\\s.*)?$", "", x), "\\b\\d A\\b")
# => [[1]]
# [1] "50A" "51A"
x <- "50A ABC DE 51A FG 52A HI 53A"
str_extract_all(sub("(\\s|^)X(\\s.*)?$", "", x), "\\b\\d A\\b")
# => [[1]]
# [1] "50A" "51A" "52A" "53A"
Here,
sub("(\\s|^)X(\\s.*)?$", "", x)
removesX
at the start of string or after a whitespace (with this whitespace) and optionally followed with whitespace and any text at the end of the stringstr_extract_all(..., "\\b\\d A\\b")
extracts one or more digits followed withA
as whole words in the remaining string part.
CodePudding user response:
Another option could be matching X
and from that point on avoid matching the rest of the like using SKIP FAIL and enable PCRE using perl=T
X .*(*SKIP)(*F)|\b\d A\b
The pattern matches:
X
Match literally.*(*SKIP)(*F)
Match the rest of the line to avoid matching it|
Or\b\d A\b
Match 1 digits andA
between word boundaries
See a regex demo and a R demo.
Example
library(stringr)
s1 <- "50A ABC DE 51A FG X 52A HI 53A"
s2 <- "50A ABC DE 51A FG 52A HI 53A"
patt <- "X .*(*SKIP)(*F)|\\b\\d A\\b"
regmatches(s1, gregexpr(patt, s1, perl=T))
regmatches(s2, gregexpr(patt, s2, perl=T))
Output
[[1]]
[1] "50A" "51A"
[[1]]
[1] "50A" "51A" "52A" "53A"