Home > front end >  R code to merge 2 data frames by whether values in the first "by" variable contain string
R code to merge 2 data frames by whether values in the first "by" variable contain string

Time:01-19

I have 2 data frames: one with a list of medications, the other with a different but highly overlapping list of medications along with corresponding medication ID codes. I want to merge these two data frames to apply the medication codes to the first data frame's medication list. I have a lot of partial string matches, and I want to detect strings in a case-insensitive manner.

library(tidyverse)
library(stringr)
label <- c("0.4% Lidocaine Hydrochloride", "10% Dextrose", "Act Raloxifene")
df1 <- as.DataFrame(label)

label2 <- c("LIDOCAINE", "RALOXIFENE", "JANUMET", "ESOMEPRAZOLE", "METFORMIN")
code <- c(0003, 0005, 0006, 0001, 0011)

df2 <- data.frame(label2, code)%>%
  rename(label=label2)

I try to use str_detect from stringr package

merge_df <- merge(df1, df2,
                     by.x=c("label" = ifelse(str_detect(df1$label, regex(df2$label, ignore_case = T)),
                                             df1$label, NA)),
                     by.y=c("label" = ifelse(str_detect(df1$label, regex(df2$label, ignore_case = T)),
                                             df2$label, NA)),
                     ignore.case=T,all.x=T,all.y=T,
                     suffixes = c("_list", "_dict"),
                     nomatch=0)

And I get the error: Error in str_detect(): ! Can't recycle string (size 3) to match pattern (size 5).

CodePudding user response:

An approach using left_join.

First add a variable l_lower in both sets containing all tolower strings, separated by strsplit to enable match of all entries. After joining and arranging the y-labels remove duplicated entries and the helper column.

library(dplyr)
library(tidyr)

left_join(df1 %>% 
      rowwise() %>% 
      mutate(l_label = strsplit(tolower(label), " ")) %>% 
      unnest(l_label), 
    df2 %>% 
      rowwise() %>% 
      mutate(l_label = unlist(strsplit(tolower(label), " "))), "l_label") %>% 
  arrange(label.y) %>% 
  group_by(label.x) %>% 
  filter(!duplicated(label.x)) %>% 
  select(-l_label) %>% 
  ungroup()
# A tibble: 3 × 3
  label.x                      label.y     code
  <chr>                        <chr>      <dbl>
1 0.4% Lidocaine Hydrochloride LIDOCAINE      3
2 Act Raloxifene               RALOXIFENE     5
3 10% Dextrose                 NA            NA

Data

df1 <- structure(list(label = c("0.4% Lidocaine Hydrochloride", "10% Dextrose",
"Act Raloxifene")), class = "data.frame", row.names = c(NA, -3L
))

df2 <- structure(list(label = c("LIDOCAINE", "RALOXIFENE", "JANUMET",
"ESOMEPRAZOLE", "METFORMIN"), code = c(3, 5, 6, 1, 11)), 
class = "data.frame", row.names = c(NA,
-5L))
  • Related