Home > front end >  gsub regex to work for both words boundaries and punctuation
gsub regex to work for both words boundaries and punctuation

Time:05-18

I'm trying to search in sentences for both words (case insensitive) and punctuation symbols. The below function works well for words, but requires \\ to work for dots for example ; and thus it leads to unwanted behavior - see below:

fun <- function(text, search) {
  gsub(paste0("\\b(", search, ")\\b"), paste0("<mark>", '\\1', "</mark>"),
       text, ignore.case = T)
}
> fun("this is a test.", ".")
[1] "this<mark> </mark>is<mark> </mark><mark>a</mark><mark> </mark>test<mark>.</mark>"

> fun("(this is a test)", ")")
[1] "(this is a test<mark></mark>"

Expecting :

> fun("this is a test.", ".")
[1] "this is a test<mark>.</mark>"

> fun("(this is a test)", ")")
[1] "(this is a test<mark>)</mark>"

What is the best way - regular expression ? - to search for words as well as punctuation symbols in a string ?

CodePudding user response:

You need

See the R code:

## Escaping function
regex.escape <- function(string) {
  gsub("([][{}() *^$|\\\\?.])", "\\\\\\1", string)
}
fun <- function(text, search) {
  gsub(paste0("(?!\\B\\w)(", regex.escape(search), ")(?<!\\w\\B)"), "<mark>\\1</mark>",
       text, ignore.case = TRUE, perl=TRUE)
}
fun("this is a test.", ".")
# [1] "this is a test<mark>.</mark>"

fun("(this is a test)", ")")
# [1] "(this is a test<mark>)</mark>"
  • Related