Home > Software engineering >  How do I match a string until certain pattern that also works when the pattern does not show up?
How do I match a string until certain pattern that also works when the pattern does not show up?

Time:11-25

I need to extract numbers followed by an A, until pattern " X " appears:

"50A ABC DE 51A FG X 52A HI 53A"

The regex \d A(?=.* X ) correctly matches 50A and 51A, because they appear before X , as shown here.

However, if a string does not have the X pattern, the regex won´t match any of the desired pattens (50A, 51A, 52A and 53A)

"50A ABC DE 51A FG 52A HI 53A"    # no X here

How do I fix that?

CodePudding user response:

You may use a PCRE regex like

\G(?:(?! X ).)*?\K\b\d A\b

See the regex demo. Details:

  • \G - start of string or end of the previous successful match (to only ensure consecutive matches)
  • (?:(?! X ).)*? - any char, other than line break char, as few as possible, that does not start a space X space char sequence
  • \K - a match reset operator that discards all text matched so far
  • \b\d A\b - one or more digits and A inside word boundaries.

In R, you can use the following base R code:

x <- "50A ABC DE 51A FG 52A HI 53A"
rx <- "\\G(?:(?! X ).)*?\\K\\b\\d A\\b"
regmatches(x, gregexpr(rx, x, perl=TRUE))
# => [[1]]
#    [1] "50A" "51A" "52A" "53A"
x <- "50A ABC DE 51A FG X 52A HI 53A"
regmatches(x, gregexpr(rx, x, perl=TRUE))
# => [[1]]
#    [1] "50A" "51A"

You can remove all after an X word, and then extract:

x <- "50A ABC DE 51A FG X 52A HI 53A"
library(stringr)
str_extract_all(sub("(\\s|^)X(\\s.*)?$", "", x), "\\b\\d A\\b")
# => [[1]]
#    [1] "50A" "51A"

x <- "50A ABC DE 51A FG 52A HI 53A"
str_extract_all(sub("(\\s|^)X(\\s.*)?$", "", x), "\\b\\d A\\b")
# => [[1]]
#    [1] "50A" "51A" "52A" "53A"

Here,

  • sub("(\\s|^)X(\\s.*)?$", "", x) removes X at the start of string or after a whitespace (with this whitespace) and optionally followed with whitespace and any text at the end of the string
  • str_extract_all(..., "\\b\\d A\\b") extracts one or more digits followed with A as whole words in the remaining string part.

CodePudding user response:

Another option could be matching X and from that point on avoid matching the rest of the like using SKIP FAIL and enable PCRE using perl=T

X .*(*SKIP)(*F)|\b\d A\b

The pattern matches:

  • X Match literally
  • .*(*SKIP)(*F) Match the rest of the line to avoid matching it
  • | Or
  • \b\d A\b Match 1 digits and A between word boundaries

See a regex demo and a R demo.

Example

library(stringr)

s1 <- "50A ABC DE 51A FG X 52A HI 53A"
s2 <- "50A ABC DE 51A FG 52A HI 53A"
patt <- "X .*(*SKIP)(*F)|\\b\\d A\\b"

regmatches(s1, gregexpr(patt, s1, perl=T))
regmatches(s2, gregexpr(patt, s2, perl=T))

Output

[[1]]
[1] "50A" "51A"

[[1]]
[1] "50A" "51A" "52A" "53A"
  • Related