Home > Net >  Negative lookahead checking for absence of multiple patterns
Negative lookahead checking for absence of multiple patterns

Time:09-21

I am currently trying to index rows that should and shouldn't be kept based on whether the regex-values in the pattern-column exists in the description-column, in the data below.

data <- data.frame(id = c(1,2,2,3,3,4), 
                   old_levels = c(0,1,1,1,1,2),
                   levels = c(1,2,3,2,3,4),
                   description = c("vegetable", "fruit", "fruit",
                                   "meat", "meat", "soda"),
                   pattern = c("vegetable",
                               "fruit", 
                               "?!(vegetable|fruit)", 
                               "fruit",
                               "?!(vegetable|fruit)", 
                               NA))

Using dplyr I figured that the below example should work:

data %>% rowwise() %>% mutate(matches = grepl(pattern, description))

However, this yields:

# A tibble: 6 x 6
# Rowwise: 
     id old_levels levels description pattern             matches
  <dbl>      <dbl>  <dbl> <chr>       <chr>               <lgl>  
1     1          0      1 vegetable   vegetable           TRUE   
2     2          1      2 fruit       fruit               TRUE   
3     2          1      3 fruit       ?!(vegetable|fruit) FALSE  
4     3          1      2 meat        fruit               FALSE  
5     3          1      3 meat        ?!(vegetable|fruit) FALSE  
6     4          2      4 soda        NA                  NA        

The NA is expected and is working as intended, however I'm struggling to get the negative lookahead to work, as matches in row 5 should be TRUE...

Any help would be appreciated!

CodePudding user response:

The lookahead syntax is (?!...), not ?!(...).

Besides, grepl with the default TRE library does not support lookarounds, you need to pass perl=TRUE.

You can use

data <- data.frame(id = c(1,2,2,3,3,4), 
                   old_levels = c(0,1,1,1,1,2),
                   levels = c(1,2,3,2,3,4),
                   description = c("vegetable", "fruit", "fruit",
                                   "meat", "meat", "soda"),
                   pattern = c("vegetable",
                               "fruit", 
                               "^(?!.*(?:vegetable|fruit))", 
                               "fruit",
                               "^(?!.*(?:vegetable|fruit))", 
                               NA))

data %>% rowwise() %>% mutate(matches = grepl(pattern, description, perl=TRUE))

Output:

> data %>% rowwise() %>% mutate(matches = grepl(pattern, description, perl=TRUE))
# A tibble: 6 x 6
# Rowwise: 
     id old_levels levels description pattern                    matches
  <dbl>      <dbl>  <dbl> <chr>       <chr>                      <lgl>  
1     1          0      1 vegetable   vegetable                  TRUE   
2     2          1      2 fruit       fruit                      TRUE   
3     2          1      3 fruit       ^(?!.*(?:vegetable|fruit)) FALSE  
4     3          1      2 meat        fruit                      FALSE  
5     3          1      3 meat        ^(?!.*(?:vegetable|fruit)) TRUE   
6     4          2      4 soda        <NA>                       NA     
  • Related