Home > Software engineering >  Matching a character vector with multiple patterns in R
Matching a character vector with multiple patterns in R

Time:01-16

I have two data.tables. The first data.table, DT_1, contains strings and a matching type as follows:

library(data.table)  
DT_1 <- data.table(Source_name = c("Apple","Banana","Orange","Pear","Random"),
                   Match_type = c("Anywhere","Beginning","Anywhere","End","End"))

I then want to return the first match of the "Source_name" string from DT_1 to the column names of DT_2 (as below) using the matching type specified in DT_1. The matches are to be carried out without case sensitivity.

DT_2 <- data.table(Pear_1 = 1,Eat_apple = 1,Split_Banana = 1,
                   Pear_2 = 1,Eat_pear = 1,Orange_peel = 1,Banana_chair = 1)

For example, the string "Apple" can be found anywhere within the column names of DT_2. The first instance of that is "Eat_apple".

For the next string, "Banana", it must be matched at the beginning of the column name strings. The first instance of this is "Banana_chair".

I have written some (really ugly) code to handle this as follows:

library(purrr)      
DT_1[,Col_Name := names(DT_2)[unlist(pmap(.l = .SD[,.(x = Match_type,y = Source_name)],
              .f = function(x,y){
                  if(x == "Anywhere"){
                       grep(tolower(y),tolower(names(DT_2)))[1] # returns the first match if there is a match anywhere
                  }else if (x == "Beginning"){
                       grep(paste0("^",tolower(y),"."),tolower(names(DT_2)))[1] # returns the first match if the string is at the beginning (denoted by the anchor "^")
                  }else if (x == "End"){
                        grep(paste0(".",tolower(y),"$"),tolower(names(DT_2)))[1] # returns the first match if the string is at the end (denoted by the end anchor "$")
                  }}))]]

I tried to use string_extract / string_detect from the stringr package to reproduce the output, but it wasn't liking the fact the fact that I had different lengths for the pattern and number of columns in DT_2.

Can anyone provide any tips on how I can improve my code here? I'm not wedded to a particular approach.

Thanks in advance, Phil

CodePudding user response:

One way of doing this would be to prepare the regex first and then for each Source_name find the 1st corresponding match.

library(dplyr)
library(purrr)
library(stringr)

cols <- names(DT_2)

DT_1 %>%
  mutate(regex = case_when(Match_type == "Anywhere" ~ Source_name, 
                           Match_type == "Beginning" ~ str_c('^',Source_name), 
                           Match_type == "End" ~str_c(Source_name, '$')), 
         Col_Name = map_chr(regex, 
                    ~str_subset(cols, regex(.x, ignore_case = TRUE))[1]))

#   Source_name Match_type   regex     Col_Name
#1:       Apple   Anywhere   Apple    Eat_apple
#2:      Banana  Beginning ^Banana Banana_chair
#3:      Orange   Anywhere  Orange  Orange_peel
#4:        Pear        End   Pear$     Eat_pear
#5:      Random        End Random$         <NA>

Note that the [1] in str_subset is useful in two scenarios

  1. When there are instances of multiple matches it returns only the 1st match.
  2. When there is no match it returns NA.
  • Related