Home > front end >  gsub regex to work for both words boundaries and punctuation
gsub regex to work for both words boundaries and punctuation


I'm trying to search in sentences for both words (case insensitive) and punctuation symbols. The below function works well for words, but requires \\ to work for dots for example ; and thus it leads to unwanted behavior - see below:

fun <- function(text, search) {
  gsub(paste0("\\b(", search, ")\\b"), paste0("<mark>", '\\1', "</mark>"),
       text, ignore.case = T)
> fun("this is a test.", ".")
[1] "this<mark> </mark>is<mark> </mark><mark>a</mark><mark> </mark>test<mark>.</mark>"

> fun("(this is a test)", ")")
[1] "(this is a test<mark></mark>"

Expecting :

> fun("this is a test.", ".")
[1] "this is a test<mark>.</mark>"

> fun("(this is a test)", ")")
[1] "(this is a test<mark>)</mark>"

What is the best way - regular expression ? - to search for words as well as punctuation symbols in a string ?

CodePudding user response:

You need

See the R code:

## Escaping function
regex.escape <- function(string) {
  gsub("([][{}() *^$|\\\\?.])", "\\\\\\1", string)
fun <- function(text, search) {
  gsub(paste0("(?!\\B\\w)(", regex.escape(search), ")(?<!\\w\\B)"), "<mark>\\1</mark>",
       text, ignore.case = TRUE, perl=TRUE)
fun("this is a test.", ".")
# [1] "this is a test<mark>.</mark>"

fun("(this is a test)", ")")
# [1] "(this is a test<mark>)</mark>"
  • Related