Home > OS >  RegEx syntax for selecting second occurence of characters
RegEx syntax for selecting second occurence of characters

Time:06-13

I have a relatively simple problem but can't figure out the right syntax in RegEx. I have multiple experiment names as strings in various formats, e.g. SEF001DT45 or BV004MF.

What I want to do is to select the second occurence of two letters after a numeric value (DT and MF in this case).

I figured out that [A-Z]{2} solves my problem only halfway. How do I get the proper substrings?

CodePudding user response:

A possible solution, based on stringr::str_extract and lookaround:

library(stringr)

strings <- c("SEF001DT45", "BV004MF")

str_extract(strings, "(?<=\\d)[:upper:]{2}")

#> [1] "DT" "MF"

CodePudding user response:

TLDR: Generally, you can get the second occurrence of a PATTERN using one of the following

sub('.*?PATTERN.*?(PATTERN).*', '\\1', x)
stringr::str_match(x, 'PATTERN.*?(PATTERN)')[,2]
regmatches(x, regexpr('PATTERN.*?\\KPATTERN', x, perl=TRUE))

Details

You can use

x <- c('SEF001DT45','BV004MF')
sub('.*?[A-Z]{2}.*?([A-Z]{2}).*', '\\1', x)
## => [1] "DT" "MF"

See the R demo online and the regex demo. The point here is to match up to the second occurrence of the pattern, capture it, and then match the rest, and replace with the backreference to the capturing group value.

Note that sub will perform a single search and replace operation, and this is fine since the regex here requires the whole string match.

Details:

  • .*? - any zero or more chars as few as possible
  • [A-Z]{2} - two uppercase ASCII letters
  • .*? - any zero or more chars as few as possible
  • ([A-Z]{2}) - Group 1 (\1 refers to this group value): two uppercase ASCII letters
  • .* - any zero or more chars as many as possible.

You can achieve this with a simpler regex using stringr::str_match:

x <- c('SEF001DT45','BV004MF')
library(stringr)
results <- stringr::str_match(x, '[A-Z]{2}.*?([A-Z]{2})')
results[,2] ## Get Group 1 values

See this R demo.

Or, with regmatches/regexpr in base R:

x <- c('SEF001DT45','BV004MF')
results <- regmatches(x, regexpr('[A-Z]{2}.*?\\K[A-Z]{2}', x, perl=TRUE))
results

See this R demo.

Here, [A-Z]{2}.*?\\K[A-Z]{2} finds the first two uppercase ASCII letters, then matches any zero or more chars (other than line break chars since the PCRE engine is used) as few as possible, and then \K discards the matched text and the [A-Z]{2} at the end of the pattern matches the second occurrence of the two-letter chunk. regexpr only finds the first match.

CodePudding user response:

Maybe:

s <- c("SEF001DT45", "BV004MF")
sub("[A-Z] \\d ([A-Z]{2}).*", "\\1", s)
#sub("[A-Z] [0-9] ([A-Z]{2}).*", "\\1", s) #Alternative
#[1] "DT" "MF"

Where [A-Z] matches characters, \\d numbers, [A-Z]{2} the two characters and .* for the remaining rest.
With () the content which is inserted with \\1 is selected.
Or something more strict about the second occurence of two letters:

sub(".*?[A-Z]{2}[0-9] ([A-Z]{2}).*", "\\1", s)
#[1] "DT" "MF"

When only the two characters after the first number should be extracted is enough:

regmatches(s, regexpr("(?<=\\d)[A-Z]{2}", s, perl=TRUE))
#[1] "DT" "MF"

CodePudding user response:

Base R:

# Using capture groups:
gsub(
  ".*\\d{2}(\\w{2}).*",
  "\\1",
  x
)

# Input data:
x <- c(
  'SEF001DT45',
  'BV004MF'
)

CodePudding user response:

Another base R trick is strsplit

> sapply(strsplit(s, split = "\\d "), `[[`, 2)
[1] "DT" "MF"

or gsub

> gsub("^.*?(?<=\\d)(\\D ).*", "\\1", s, perl = TRUE)
[1] "DT" "MF"
  • Related