grepl excluding a set of words before pattern-CodePudding

I would like to capture all mentions of "pensions" (capital-insensitive, including pensions, pensioners, but excluding unrelated words like "suspension." However, I would like to exclude pensions when they are preceded by "Department of Work and "; but I can't manage to capture the whole expression. So far I have:

sentences <- c("department of work and pensions", "and pensioners", "pensioners", "Pensions", "suspension")
try <- grepl("(?<!department of work and )^pension*", ignore.case = T, perl = T, sentences)
try

Any advice?

CodePudding user response：

grep('(?<!department of work and )\\bpension', sentences, 
        value = TRUE, ignore.case = TRUE, perl = TRUE)

[1] "and pensioners" "pensioners"     "Pensions"

CodePudding user response：

You can use a single pattern that will account for any whitespaces between the words and also match pension only at the word boundary:

sentences <- c("department of work and pensions", "and pensioners", "pensioners", "Pensions", "suspension")
grepl("\\bdepartment of work and \\w (*SKIP)(*F)|\\bpension", ignore.case = T, perl = T, sentences)
## => [1] FALSE  TRUE  TRUE  TRUE FALSE

See the R demo and the regex demo.

Details:

\bdepartment of work and \w - word boundary \b, department of work and space one or more word chars
(*SKIP)(*F) - omit all text matched so far and start the next match search from the failure position
| - or
\bpension - word boundary \b and a pension substring.

CodePudding user response：

We may use

grepl("\\bpension\\S ", sentences, ignore.case = TRUE) & 
      !grepl("department of work .*\\bpension\\S ", sentences, ignore.case = TRUE)