I have two data.tables. The first data.table, DT_1, contains strings and a matching type as follows:
library(data.table)
DT_1 <- data.table(Source_name = c("Apple","Banana","Orange","Pear","Random"),
Match_type = c("Anywhere","Beginning","Anywhere","End","End"))
I then want to return the first match of the "Source_name" string from DT_1 to the column names of DT_2 (as below) using the matching type specified in DT_1. The matches are to be carried out without case sensitivity.
DT_2 <- data.table(Pear_1 = 1,Eat_apple = 1,Split_Banana = 1,
Pear_2 = 1,Eat_pear = 1,Orange_peel = 1,Banana_chair = 1)
For example, the string "Apple" can be found anywhere within the column names of DT_2. The first instance of that is "Eat_apple".
For the next string, "Banana", it must be matched at the beginning of the column name strings. The first instance of this is "Banana_chair".
I have written some (really ugly) code to handle this as follows:
library(purrr)
DT_1[,Col_Name := names(DT_2)[unlist(pmap(.l = .SD[,.(x = Match_type,y = Source_name)],
.f = function(x,y){
if(x == "Anywhere"){
grep(tolower(y),tolower(names(DT_2)))[1] # returns the first match if there is a match anywhere
}else if (x == "Beginning"){
grep(paste0("^",tolower(y),"."),tolower(names(DT_2)))[1] # returns the first match if the string is at the beginning (denoted by the anchor "^")
}else if (x == "End"){
grep(paste0(".",tolower(y),"$"),tolower(names(DT_2)))[1] # returns the first match if the string is at the end (denoted by the end anchor "$")
}}))]]
I tried to use string_extract / string_detect from the stringr package to reproduce the output, but it wasn't liking the fact the fact that I had different lengths for the pattern and number of columns in DT_2.
Can anyone provide any tips on how I can improve my code here? I'm not wedded to a particular approach.
Thanks in advance, Phil
CodePudding user response:
One way of doing this would be to prepare the regex first and then for each Source_name
find the 1st corresponding match.
library(dplyr)
library(purrr)
library(stringr)
cols <- names(DT_2)
DT_1 %>%
mutate(regex = case_when(Match_type == "Anywhere" ~ Source_name,
Match_type == "Beginning" ~ str_c('^',Source_name),
Match_type == "End" ~str_c(Source_name, '$')),
Col_Name = map_chr(regex,
~str_subset(cols, regex(.x, ignore_case = TRUE))[1]))
# Source_name Match_type regex Col_Name
#1: Apple Anywhere Apple Eat_apple
#2: Banana Beginning ^Banana Banana_chair
#3: Orange Anywhere Orange Orange_peel
#4: Pear End Pear$ Eat_pear
#5: Random End Random$ <NA>
Note that the [1]
in str_subset
is useful in two scenarios
- When there are instances of multiple matches it returns only the 1st match.
- When there is no match it returns
NA
.