Home > Net >  Find if a string appear before another string
Find if a string appear before another string

Time:09-21

I have a string variable containing patients' addresses. My goal is to flag patients who live in "401 30th street". I would like to flags strings that contain the number "401" before "30" to avoid flagging addresses like number 3. My code below only flag whether the string contains the number 401 and 30 regardless of their positions. Any help would be appreciate it.

          ADDRESS
1     401 30th st
2 40120 30 street
3      30 401 st
4     401 30th st
structure(list(ADDRESS = c("401 30th st", "40120 30 street", 
"30 401 st", "401 30th st")), class = "data.frame", row.names = c(NA, 
-4L))
loction <- location %>%
  mutate(ADDRESS = tolower(ADDRESS),
         st30 =  grepl("\\<401\\>", ADDRESS) & 
          grepl("\\<30\\>|\\<30th\\>|\\<30st\\>|\\<e30th\\>|\\<e30\\>", ADDRESS))

CodePudding user response:

Try with

library(dplyr)
library(stringr)
location %>%
    mutate(flag = str_detect(ADDRESS, '401\\b.*\\b30'))

CodePudding user response:

You may try this:

library(dplyr)
library(stringr)
location %>% 
    mutate(flag = str_detect(ADDRESS, '^[^30]*401 .*30.*$'))

output:

          ADDRESS  flag
1     401 30th st  TRUE
2 40120 30 street FALSE
3       30 401 st FALSE
4     401 30th st  TRUE

CodePudding user response:

When you use two separate grepl calls, the matches are searched for irrespective of the order of their appearance in the string.

Maching two substrings in order means

  • Matching the leftmost pattern
  • Matching any chars (because the regex engine must somehow get to the second pattern) with a pattern like .*, .*?, [\s\S]*?, (?s:.)*? (the latter two are PCRE/ICU compliant), etc.
  • Matching the rightmost pattern.

So, here, as there are no line breaks in the input, you could probably use

df %>%
    mutate(st30 = grepl('401.*?30', ADDRESS))

However, 401 and 30 patterns above are matching in any context. If you want to match them as exact integer values, you need to use numeric boundaries:

grepl('(?<!\\d)401(?!\\d).*?(?<!\\d)30(?!\\d)', ADDRESS, perl=TRUE)

Probably, you can also get away with simple word boundaries at the start of these numeric patterns (i.e. before them, no letter, digit or underscore are allowed):

grepl('\\b401(?!\\d).*?\\b30(?!\\d)', ADDRESS, perl=TRUE)
  • Related