I am masking phone numbers and personal names on my raw data. I already asked and got the answer here for my work about phone numbers.
In the case of masking personal names, I have the following code:
x = c("010-1234-5678",
"John 010-8888-8888",
"Phone: 010-1111-2222",
"Peter 018.1111.3333",
"Year(2007,2019,2020)",
"Alice 01077776666")
df = data.frame(
phoneNumber = x
)
delName = c("John", "Peter", "Alice")
for (name in delName) {
df$phoneNumber <- gsub(name, "anonymous", df$phoneNumber)
}
That code is not a problem for me,
> df
phoneNumber
1 010-1234-5678
2 anonymous 010-8888-8888
3 Phone: 010-1111-2222
4 anonymous 018.1111.3333
5 Year(2007,2019,2020)
6 anonymous 01077776666
but I have over 10,000 personal names to mask. R is working 789th process now. Time can solve it, but I would like to know the way to reduce processing time. I searched foreach
, but I do not know how to tune my original code above.
CodePudding user response:
You could try this without a loop first and paste
strings together with an or \
.
(delNamec <- paste(delName, collapse='|'))
# [1] "John|Peter|Alice"
gsub(delNamec, 'anonymous', df$phoneNumber)
# [1] "010-1234-5678"
# [2] "anonymous 010-8888-8888"
# [3] "Phone: 010-1111-2222"
# [4] "anonymous 018.1111.3333"
# [5] "Year(2007,2019,2020)"
# [6] "anonymous 01077776666"
Runs in a blink of an eye, even with 100k rows.
df2 <- df[sample(nrow(df), 1e5, replace=T),,drop=F]
dim(df2)
# [1] 100000 1
system.time(gsub(delNamec, 'anonymous', df2$phoneNumber))
# user system elapsed
# 0.129 0.000 0.129
CodePudding user response:
Here is another option using stringr
.
library(stringr)
str_replace_all(
string = df$phoneNumber,
pattern = paste(delName, collapse = '|'),
replacement = "anonymous"
)