I want to identify strings which feature two keywords that have between 0 and 3 words between them. What I have works in most cases:
strings <- c(
"Today is my birthday",
"Today is not yet my birthday",
"Today birthday",
"Today maybe?",
"Today: birthday"
)
grepl("Today(\\s\\w ){0,3}\\sbirthday", strings, ignore.case = TRUE)
#> [1] TRUE FALSE TRUE FALSE FALSE
Created on 2021-11-24 by the reprex package (v2.0.1)
My issue is with the string "Today: birthday"
. The problem is that a word is defined as (\\s\\w )
leaving no option for the sentence to contain any punctuation. How can I better define the regex for word so that punctuation is not excluded (best would be to ignore it).
CodePudding user response:
You can use
> grepl("Today(\\W \\w ){0,3}\\W birthday", strings, ignore.case = TRUE)
[1] TRUE FALSE TRUE FALSE TRUE
Also, consider using word boundaries, non-capturing groups, and the more stable PCRE regex engine:
grepl("\\bToday(?:\\W \\w ){0,3}\\W birthday\\b", strings, ignore.case = TRUE, perl=TRUE)
The (?:\W \w ){0,3}\W
part matches zero to three occurrences of one or more non-word chars (\W
) and then one or more word chars (\w
) and then one or more non-word chars.