I have a string variable containing patients' addresses. My goal is to flag patients who live in "401 30th street". I would like to flags strings that contain the number "401" before "30" to avoid flagging addresses like number 3. My code below only flag whether the string contains the number 401 and 30 regardless of their positions. Any help would be appreciate it.
ADDRESS
1 401 30th st
2 40120 30 street
3 30 401 st
4 401 30th st
structure(list(ADDRESS = c("401 30th st", "40120 30 street",
"30 401 st", "401 30th st")), class = "data.frame", row.names = c(NA,
-4L))
loction <- location %>%
mutate(ADDRESS = tolower(ADDRESS),
st30 = grepl("\\<401\\>", ADDRESS) &
grepl("\\<30\\>|\\<30th\\>|\\<30st\\>|\\<e30th\\>|\\<e30\\>", ADDRESS))
CodePudding user response:
Try with
library(dplyr)
library(stringr)
location %>%
mutate(flag = str_detect(ADDRESS, '401\\b.*\\b30'))
CodePudding user response:
You may try this:
library(dplyr)
library(stringr)
location %>%
mutate(flag = str_detect(ADDRESS, '^[^30]*401 .*30.*$'))
output:
ADDRESS flag
1 401 30th st TRUE
2 40120 30 street FALSE
3 30 401 st FALSE
4 401 30th st TRUE
CodePudding user response:
When you use two separate grepl
calls, the matches are searched for irrespective of the order of their appearance in the string.
Maching two substrings in order means
- Matching the leftmost pattern
- Matching any chars (because the regex engine must somehow get to the second pattern) with a pattern like
.*
,.*?
,[\s\S]*?
,(?s:.)*?
(the latter two are PCRE/ICU compliant), etc. - Matching the rightmost pattern.
So, here, as there are no line breaks in the input, you could probably use
df %>%
mutate(st30 = grepl('401.*?30', ADDRESS))
However, 401
and 30
patterns above are matching in any context. If you want to match them as exact integer values, you need to use numeric boundaries:
grepl('(?<!\\d)401(?!\\d).*?(?<!\\d)30(?!\\d)', ADDRESS, perl=TRUE)
Probably, you can also get away with simple word boundaries at the start of these numeric patterns (i.e. before them, no letter, digit or underscore are allowed):
grepl('\\b401(?!\\d).*?\\b30(?!\\d)', ADDRESS, perl=TRUE)