Home > Software engineering >  R regex match beginning and middle of a string
R regex match beginning and middle of a string

Time:04-06

I have a vector of strings:

A <- c("Hello world", "Green 44", "Hot Beer", "Bip 6t")

I want to add an asterisk (*) at the beginning and at the end of every first word like this:

"*Hello* world", "*Green* 44", "*Hot* Beer", "*Bip* 6t"

Make sense to use str_replace() from stringr. However, I am struggling with regex to match the first word of each string.

The best achievement ended up with:

str_replace(A, "^([A-Z])", "*\\1*"))
"*H*ello world", "*G*reen 44", "*H*ot Beer", "*B*ip 6t"

I might expect to be a straightforward task, but I am not getting along with regex.

Thanks!

CodePudding user response:

You were almost there

str_replace(A, "(^.*) ", "*\\1* ")
#> [1] "*Hello* world" "*Green* 44"    "*Hot* Beer"    "*Bip* 6t" 

CodePudding user response:

You can use

sub("([[:alpha:]] )", "*\\1*", A)
## => [1] "*Hello* world" "*Green* 44"    "*Hot* Beer"    "*Bip* 6t"     

The stringr equivalent is

library(stringr)
stringr::str_replace(A, "([[:alpha:]] )", "*\\1*")
stringr::str_replace(A, "(\\p{L} )", "*\\1*")

See the R demo online. See the regex demo online.

The ([[:alpha:]] ) regex matches and captures one or more letters into Group 1 and *\1* replacement replaces the match with * Group 1 value *.

Note that sub finds and replaces the first match only, so only the first word is affected in each character vector.

Notes

  • If you plan to wrap the word exactly at the start of a string (not just the "first word"), add ^ at the start of the pattern (e.g. sub("^([[:alpha:]] )", "*\\1*", A))
  • If the word is a chunk of non-whitespace chars, use \S instead of [[:alpha:]] or \p{L} (e.g. sub("^(\\S )", "*\\1*", A))
  • If the word is any chunk of letters or digits or underscores, you can use \w , i.e. sub("^(\\w )", "*\\1*", A)
  • If the word is any chunk of letters or digits but not underscores, you can use [[:alnum:]] , i.e. sub("^([[:alnum:]] )", "*\\1*", A)
  • Related