Home > Software engineering >  R using regular expression to look up values in mutate
R using regular expression to look up values in mutate

Time:05-03

I have a reference table which has 2 columns, for example below:

pattern name
a.*b name 1
c\d{2} name 2

Let's say I have a dataframe with column col like below:

col
adb
c12
add

I want to use the pattern and based on the pattern to create another column based on col.

Using above example, the values for the new columns should be c("name 1", "name 2", NA) respectively. I tried to write a loop with string detect like below:

regex_map <- function(in_string){
  ref_table <- read_excel("./data/meta_data.xlsx", "mapping_ex") %>% filter(!is.na(pattern))
  
  for(i in 1:nrow(ref_table)){
    r <- ref_table[i,] 
    #print(str(r))
    if(str_detect(tolower(in_string), r$pattern)){
      return(r$name)
    }
  }
  
  return("N/A")
}

The function works fine, however, if I put the function as part of mutate, it is extremely slow, probably as expected. I am wondering how to efficiently do this in R? Thanks for your help!!

CodePudding user response:

The function is extremely slow because you are reading the ref_table every time you call it. Read the file only once outside the mutate and pass it on as a 2nd function regex_map argument.
You can further speed up the loop by having in_string all lower case just once outside the loop.
I am using base::grep, not stringr::str_detect.

y <- '
col
adb
c12
add'
df1 <- read.table(textConnection(y), header = TRUE)

suppressPackageStartupMessages({
  library(dplyr)
  library(readxl)
})

regex_map <- function(in_string, ref_table){
  res <- rep("N/A", length(in_string))
  in_string <- tolower(in_string)
  for(i in seq_len(nrow(ref_table))){
    r <- ref_table[i, , drop = FALSE] 
    found <- grep(r$pattern, in_string)
    if(length(found)){
      res[found] <- r$name
    }
  }
  res
}

ref_table_file <- file.path("~", "Temp", "meta_data.xlsx")
ref_table <- read_excel(ref_table_file, "mapping_ex") %>% filter(!is.na(pattern))

df1 %>%
  mutate(clean = regex_map(col, ref_table))
#>   col  clean
#> 1 adb name 1
#> 2 c12 name 2
#> 3 add    N/A

Created on 2022-05-02 by the reprex package (v2.0.1)

CodePudding user response:

Another possible solution, based on tidyverse:

library(tidyverse)

df1 <- data.frame(
  pattern = c("a.*b", "c\\d{2}"),
  name = c("name1", "name2")
)

df2 <- data.frame(
  col = c("adb", "c12", "add")
)

df2 %>% 
  rowid_to_column() %>% 
  full_join(df1 %>% rowid_to_column()) %>% 
  mutate(name = if_else(str_detect(col, pattern), name, NA_character_)) %>% 
  select(col, name)

#> Joining, by = "rowid"
#>   col  name
#> 1 adb name1
#> 2 c12 name2
#> 3 add  <NA>
  • Related