I have a reference table which has 2 columns, for example below:
pattern | name |
---|---|
a.*b | name 1 |
c\d{2} | name 2 |
Let's say I have a dataframe with column col like below:
col |
---|
adb |
c12 |
add |
I want to use the pattern and based on the pattern to create another column based on col.
Using above example, the values for the new columns should be c("name 1", "name 2", NA) respectively. I tried to write a loop with string detect like below:
regex_map <- function(in_string){
ref_table <- read_excel("./data/meta_data.xlsx", "mapping_ex") %>% filter(!is.na(pattern))
for(i in 1:nrow(ref_table)){
r <- ref_table[i,]
#print(str(r))
if(str_detect(tolower(in_string), r$pattern)){
return(r$name)
}
}
return("N/A")
}
The function works fine, however, if I put the function as part of mutate, it is extremely slow, probably as expected. I am wondering how to efficiently do this in R? Thanks for your help!!
CodePudding user response:
The function is extremely slow because you are reading the ref_table
every time you call it. Read the file only once outside the mutate
and pass it on as a 2nd function regex_map
argument.
You can further speed up the loop by having in_string
all lower case just once outside the loop.
I am using base::grep
, not stringr::str_detect
.
y <- '
col
adb
c12
add'
df1 <- read.table(textConnection(y), header = TRUE)
suppressPackageStartupMessages({
library(dplyr)
library(readxl)
})
regex_map <- function(in_string, ref_table){
res <- rep("N/A", length(in_string))
in_string <- tolower(in_string)
for(i in seq_len(nrow(ref_table))){
r <- ref_table[i, , drop = FALSE]
found <- grep(r$pattern, in_string)
if(length(found)){
res[found] <- r$name
}
}
res
}
ref_table_file <- file.path("~", "Temp", "meta_data.xlsx")
ref_table <- read_excel(ref_table_file, "mapping_ex") %>% filter(!is.na(pattern))
df1 %>%
mutate(clean = regex_map(col, ref_table))
#> col clean
#> 1 adb name 1
#> 2 c12 name 2
#> 3 add N/A
Created on 2022-05-02 by the reprex package (v2.0.1)
CodePudding user response:
Another possible solution, based on tidyverse
:
library(tidyverse)
df1 <- data.frame(
pattern = c("a.*b", "c\\d{2}"),
name = c("name1", "name2")
)
df2 <- data.frame(
col = c("adb", "c12", "add")
)
df2 %>%
rowid_to_column() %>%
full_join(df1 %>% rowid_to_column()) %>%
mutate(name = if_else(str_detect(col, pattern), name, NA_character_)) %>%
select(col, name)
#> Joining, by = "rowid"
#> col name
#> 1 adb name1
#> 2 c12 name2
#> 3 add <NA>