I have 2 data frames: one with a list of medications, the other with a different but highly overlapping list of medications along with corresponding medication ID codes. I want to merge these two data frames to apply the medication codes to the first data frame's medication list. I have a lot of partial string matches, and I want to detect strings in a case-insensitive manner.
library(tidyverse)
library(stringr)
label <- c("0.4% Lidocaine Hydrochloride", "10% Dextrose", "Act Raloxifene")
df1 <- as.DataFrame(label)
label2 <- c("LIDOCAINE", "RALOXIFENE", "JANUMET", "ESOMEPRAZOLE", "METFORMIN")
code <- c(0003, 0005, 0006, 0001, 0011)
df2 <- data.frame(label2, code)%>%
rename(label=label2)
I try to use str_detect from stringr package
merge_df <- merge(df1, df2,
by.x=c("label" = ifelse(str_detect(df1$label, regex(df2$label, ignore_case = T)),
df1$label, NA)),
by.y=c("label" = ifelse(str_detect(df1$label, regex(df2$label, ignore_case = T)),
df2$label, NA)),
ignore.case=T,all.x=T,all.y=T,
suffixes = c("_list", "_dict"),
nomatch=0)
And I get the error:
Error in str_detect()
:
! Can't recycle string
(size 3) to match pattern
(size 5).
CodePudding user response:
An approach using left_join
.
First add a variable l_lower in both sets containing all tolower
strings, separated by strsplit
to enable match of all entries.
After joining and arranging the y-labels remove duplicated entries and the helper column.
library(dplyr)
library(tidyr)
left_join(df1 %>%
rowwise() %>%
mutate(l_label = strsplit(tolower(label), " ")) %>%
unnest(l_label),
df2 %>%
rowwise() %>%
mutate(l_label = unlist(strsplit(tolower(label), " "))), "l_label") %>%
arrange(label.y) %>%
group_by(label.x) %>%
filter(!duplicated(label.x)) %>%
select(-l_label) %>%
ungroup()
# A tibble: 3 × 3
label.x label.y code
<chr> <chr> <dbl>
1 0.4% Lidocaine Hydrochloride LIDOCAINE 3
2 Act Raloxifene RALOXIFENE 5
3 10% Dextrose NA NA
Data
df1 <- structure(list(label = c("0.4% Lidocaine Hydrochloride", "10% Dextrose",
"Act Raloxifene")), class = "data.frame", row.names = c(NA, -3L
))
df2 <- structure(list(label = c("LIDOCAINE", "RALOXIFENE", "JANUMET",
"ESOMEPRAZOLE", "METFORMIN"), code = c(3, 5, 6, 1, 11)),
class = "data.frame", row.names = c(NA,
-5L))