set repeat times in Look-Behind pattern in r using str

Context

I got a vector a.

a = c('nameJack\n', 'name Lucy\n', 'name  Rose\n', 'name   Biden\n', 'name    Peter\n')

Question

I want to extract the real name in a. such as:

[1] "Jack"      "Lucy"     "Rose"     "Biden"     "Peter"

But the characters I extract always contain spaces.

What I've done

I tried:

> str_extract(a, "(?<=name\\s).*(?=\n)")
[1] NA         "Lucy"     " Rose"    "  Biden"  "   Peter"

Then I tried:

> str_extract(a, "(?<=name\\s*).*(?=\n)")
Error in stri_extract_first_regex(string, pattern, opts_regex = opts(pattern)) : 
  Look-Behind pattern matches must have a bounded maximum length. (U_REGEX_LOOK_BEHIND_LIMIT, context=`(?<=name\s*).*(?=
)`)

I also tried:

> str_extract(a, "(?<=name\\s{0,6}).*(?=\n)")
[1] "Jack"      " Lucy"     "  Rose"    "   Biden"  "    Peter"

CodePudding user response：

Rather than trying to match the name with ".*", which will pick up the space characters, you could use "\\w " instead to match one or more word characters:

library(stringr)

a <- c('nameJack\n', 'name Lucy\n', 'name  Rose\n', 'name   Biden\n', 'name    Peter\n')

str_extract(a, "(?<=name\\s{0,6})\\w (?=\n)")
#> [1] "Jack"  "Lucy"  "Rose"  "Biden" "Peter"

Or another approach would be to use str_replace() with a capturing group, which is nice in that it frees you from needing look-behind/ahead assertions, leading to a somewhat more readable regex pattern:

str_replace(a, "name\\s*(\\w )\n", "\\1")
#> [1] "Jack"  "Lucy"  "Rose"  "Biden" "Peter"