I would like to join two data sets that look like the following data sets. The matching rule would be that the Item
variable from mykey
matches the first part of the Item
entry in mydata
to some degree.
mydata <- tibble(Item = c("ab_kssv", "ab_kd", "cde_kh", "cde_ksa", "cde"),
Answer = c(1,2,3,4,5),
Avg = rep(-100, length(Item)))
mykey <- tibble(Item = c("ab", "cde"),
Avg = c(0 ,10))
The result should be the following:
Item Answer Avg
1 ab_kssv 1 0
2 ab_kd 2 0
3 cde_kh 3 10
4 cde_ksa 4 10
5 cde 5 10
I looked at these three SO questions, but did not find a nice solution there. I also briefly tried the fuzzyjoin package, but that did not work. Finally, I have a for-loop-based solution:
for (currLine in 1:nrow(mydata)) {
mydata$Avg[currLine] <- mykey$Avg[str_starts(mydata$Item[currLine], mykey$Item)]
}
It does the job, but is not nice to read / understand and I wonder if there is a possibility to make the "by" argument of full_join()
from the dplyr package a bit more tolerant with its matching. Any help will be apreciated!
CodePudding user response:
Using a fuzzyjoin::regex_left_join
you could do:
Note: I renamed the Item
column in your mykey
dataset to regex
to make clear that this is the regex to match by and added a "^"
to ensure that we match at the beginning of the Item
column in the mydata
dataset.
library(fuzzyjoin)
library(dplyr)
mykey <- mykey %>%
rename(regex = Item) %>%
mutate(regex = paste0("^", regex))
mydata %>%
select(-Avg) %>%
regex_left_join(mykey, by = c(Item = "regex")) %>%
select(-regex)
#> # A tibble: 5 × 3
#> Item Answer Avg
#> <chr> <dbl> <dbl>
#> 1 ab_kssv 1 0
#> 2 ab_kd 2 0
#> 3 cde_kh 3 10
#> 4 cde_ksa 4 10
#> 5 cde 5 10