My raw data has a lot of personal information, so I am masking them in R. The sample data and my original code are below:
install.packages("stringr")
library(string)
x = c("010-1234-5678",
"John 010-8888-8888",
"Phone: 010-1111-2222",
"Peter 018.1111.3333",
"Year(2007,2019,2020)",
"Alice 01077776666")
df = data.frame(
phoneNumber = x
)
pattern1 = "\\d{3}-\\d{4}-\\d{4}"
pattern2 = "\\d{3}.\\d{4}.\\d{4}"
pattern3 = "\\d{11}"
delPhoneList1 <- str_match_all(df, pattern1) %>% unlist
delPhoneList2 <- str_match_all(df, pattern2) %>% unlist
delPhoneList3 <- str_match_all(df, pattern3) %>% unlist
I found three types of patterns from the dataset and each result is below:
> delPhoneList1
[1] "010-1234-5678" "010-8888-8888" "010-1111-2222"
> delPhoneList2
[1] "010-1234-5678" "010-8888-8888" "010-1111-2222" "018.1111.3333" "007,2019,2020"
> delPhoneList3
[1] "01077776666"
Pattern1 is the typical type of phone number in my country using a dash, but someone types in the number like pattern2 using a comma. However, pattern2 also includes pattern1, so it detects the other pattern like a series of the year. It is an unexpected result.
My question is how to match the exact pattern that I define. The pattern2 includes excessive patterns such as "007,2019,2020"
from "Year(2007,2019,2020)"
.
Additionally, the next step is masking the number using the below code:
for (phone in delPhoneList1) {
df$phoneNumber <- gsub(phone, "010-9999-9999", df$phoneNumber)
}
I think the code is perfect for me, but if you had a more efficient way, please let me know.
Thanks.
CodePudding user response:
One pattern to rule them all ;-)
ptn <- "\\b\\d{3}([-.]?)\\d{4}\\1\\d{4}\\b"
grepl(ptn, x)
# [1] TRUE TRUE TRUE TRUE FALSE TRUE
The reason your
pattern2
failed is because it used.
as a separator, but in regex that means "any character". You could have use\\.
instead of.
and it would have behaved better.I'm using place holders here: if the first separator is a
-
, then\\1
ensures that the other separator is the same. If it's empty, then the second is empty as well. This also allows the 11 uninterrupted numbers ofpattern3
.The
\\b
are word-boundaries, assuring us that 12-digits would not match:grepl(ptn, c("12345678901", "123456789012")) # [1] TRUE FALSE
Since this has a placeholder, it tends to mess a little with stringr::
functions, but we can work around that, depending on what you need.
For instance, if you replace the placeholder with a second instance of the same pattern, it might allow 123-4444.5555
(mixed separators), if that's not a problem.
ptn2 <- "\\b\\d{3}[-.]?\\d{4}[-.]?\\d{4}\\b"
unlist(str_match_all(x, ptn2))
# [1] "010-1234-5678" "010-8888-8888" "010-1111-2222" "018.1111.3333" "01077776666"
or we can exploit the number of patterns matched (original ptn
):
unlist(str_match(x, ptn)[,1])
# [1] "010-1234-5678" "010-8888-8888" "010-1111-2222" "018.1111.3333" NA "01077776666"