Home > OS >  R stringr regex to extract characters within brackets
R stringr regex to extract characters within brackets

Time:12-15

I'm trying to use regex in R to extract the entire string within brackets, where the brackets contain a keyword:

library(stringr)
test <- "asdf asiodjfojewl kjwnkjwnefkjnkf [asdf] fasdfads fewrw [keyword<1] keyword [keyword>1]"

Should return

keyword<1 # fine if it returns [keyword<1] with the brackets too instead
keyword>1

My attempt returns all of the letters individually and excludes the number from the brackets.

# my attempt
str_extract_all(test, regex("[\\<keyword\\>.*?]"))
[[1]]
 [1] "d" "o" "d" "o" "e" "w" "k" "w" "k" "w" "e" "k" "k" "d" "d" "d" "e" "w" "r" "w" "k" "e" "y" "w" "o" "r" "d" "<" "k" "e" "y" "w" "o" "r"
[35] "d" "k" "e" "y" "w" "o" "r" "d" ">"

CodePudding user response:

This creates the string ]...[ where ... is test and then split it on ]...[ where ... is the shortest string until the next [. In the strsplit regex ] matches itself and then .*?\[ matches the shortest string until and including the next [. This returns a component for each component of test (assuming test could be a character vector) and then returns the results that have a < or > in them. No packages are used.

test |>
  sprintf(fmt = "]%s[") |>
  strsplit("].*?\\[") |>
  lapply(grep, pattern = "[<>]", value = TRUE)
## [[1]]
## [1] "keyword<1" "keyword>1"

CodePudding user response:

You can use

library(stringr)
test <- "asdf asiodjfojewl kjwnkjwnefkjnkf [asdf] fasdfads fewrw [keyword<1] keyword [keyword>1]"
## If the word is right after "[":
str_extract_all(test, "(?<=\\[)keyword[^\\]\\[]*(?=])")
## If the word is anywhere betwee "[" and "]":
str_extract_all(test, "(?<=\\[)[^\\]\\[]*?keyword[^\\]\\[]*(?=])")
## =>
# [[1]]
# [1] "keyword<1" "keyword>1"

See the R demo online.

The regexps match:

  • (?<=\[) - a positive lookbehind that requires a [ char to appear immediately to the left of the current location
  • keyword - a literal string
  • [^\]\[]* - zero or more chars other than [ and ]
  • (?=]) - a positive lookahead that requires a ] char to appear immediately to the right of the current location.

See the online regex demo.

  • Related