Home > Software engineering >  Find two keywords if they are between 0 and 3 words apart
Find two keywords if they are between 0 and 3 words apart

Time:11-24

I want to identify strings which feature two keywords that have between 0 and 3 words between them. What I have works in most cases:

strings <- c(
  "Today is my birthday",
  "Today is not yet my birthday",
  "Today birthday",
  "Today maybe?",
  "Today: birthday"
)


grepl("Today(\\s\\w ){0,3}\\sbirthday", strings, ignore.case = TRUE)
#> [1]  TRUE FALSE  TRUE FALSE FALSE

Created on 2021-11-24 by the reprex package (v2.0.1)

My issue is with the string "Today: birthday". The problem is that a word is defined as (\\s\\w ) leaving no option for the sentence to contain any punctuation. How can I better define the regex for word so that punctuation is not excluded (best would be to ignore it).

CodePudding user response:

You can use

> grepl("Today(\\W \\w ){0,3}\\W birthday", strings, ignore.case = TRUE)
[1]  TRUE FALSE  TRUE FALSE  TRUE

Also, consider using word boundaries, non-capturing groups, and the more stable PCRE regex engine:

grepl("\\bToday(?:\\W \\w ){0,3}\\W birthday\\b", strings, ignore.case = TRUE, perl=TRUE)

The (?:\W \w ){0,3}\W part matches zero to three occurrences of one or more non-word chars (\W ) and then one or more word chars (\w ) and then one or more non-word chars.

  • Related