In a large dataset, I need to replace the territory names by their corresponding code.
Here below a small replicable example:
library(tidyverse)
library(stringr)
library(dplyr)
library(tidyr)
library(purrr)
library(stringi)
adrs_data <- data.frame (adress = c("6 Frien Street, Paris", "Toulouse, 7 Hospital street", "10 market avenue (Bordeaux)") )
dep_code <- data.frame (code = c("75", "31", "33"), names = c("Paris", "Toulouse", "Bordeaux"))
This is what I have tried:
d_search<-c(dep_code$names)
d_search <- paste(paste0(d_search[order(-nchar(d_search))]), collapse = "|")
c_search<-c(dep_code $code)
df<-adrs_data %>%
dplyr::mutate(c_adress = case_when(adress %in% d_search ~
str_replace_all(adress, d_search, c_search), TRUE ~ adress))
But it does not produce the wanted output which is:
df <- data.frame (adress = c("6 Frien Street, 75", "31, 7 Hospital street", "10 market avenue (33)")
Thank you for your help, Best regards
CodePudding user response:
After merging both data frames, you can use pmap
to replace the pattern for each row:
library(dplyr)
library(stringr)
library(purrr)
library(fuzzyjoin)
fuzzy_left_join(adrs_data, dep_code, match_fun = str_detect,
by = c("adress" = "names")) %>%
mutate(adress = pmap(., ~ str_replace(..1, ..3, ..2)))
# adress code names
# 1 6 Frien Street, 75 75 Paris
# 2 31, 7 Hospital street 31 Toulouse
# 3 10 market avenue (33) 33 Bordeaux
CodePudding user response:
This also works and removes parenthesis:
I added one more entry to adrs_data
adrs_data <- data.frame (adress = c("6 Frien Street, Paris", "Toulouse, 7 Hospital street", "10 market avenue (Bordeaux)", "9 Test Street, Paris") )
dep_code <- data.frame (code = c("75", "31", "33"), names = c("Paris", "Toulouse", "Bordeaux"))
sapply(seq(nrow(adrs_data)), \(i){
adrs_data[i,] %>%
str_replace_all(dep_code$names, dep_code$code) %>%
.[.!= adrs_data[i,]] %>%
sub(pattern = "\\(", replacement = "", x = .) %>%
sub(pattern = "\\)", replacement = "", x = .)
})
[,1]
[1,] "6 Frien Street, 75"
[2,] "31, 7 Hospital street"
[3,] "10 market avenue 33"
[4,] "9 Test Street, 75"
For very short data this even provides decent speed.
Unit: milliseconds
min lq mean median uq max neval
1.508001 1.671950 2.481109 1.823151 2.534501 11.9309 100 # lapply
58.665500 67.577851 85.267257 76.948251 89.716950 231.3375 100 # fuzzy
Would be interesting to compare with @Maël's solution for a big data set. I assume the larger the data the slower the lapply solution.