I would like to do exact joins for the columns state and name, but a fuzzy join for the "name" and "versus" columns:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
df1 <- data.frame(year, state, name)
year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("@ george v. SALLY", "@laura v. dAvid")
df2 <- data.frame(year, state, versus)
My preferred output would be the following:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
versus <- c("@ george v. SALLY", "@ george v. SALLY", "@laura v. dAvid", "@laura v. dAvid", NA, NA)
df3 <- data.frame(year, state, name, versus)
I've tried variations of the following:
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("year", "state", "name" = "versus"), method = "hamming")
stringdist_left_join(df1, df2, by = c("year", "state"), method = "hamming")
And they don't seem to get close to what I want.
I'm wondering if I'll need to spit up the "versus" column (remove all special characters and delimit the names) or if there's a way for me to accomplish this with something within fuzzyjoin. Any guidance would be appreciated.
CodePudding user response:
A simple approach, which depends somewhat on the structure of df2$versus
, would be this:
library(dplyr)
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))
Output:
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George @ george v. SALLY
2 2002 TN Sally @ george v. SALLY
3 1999 AL David @laura v. dAvid
4 1999 AL Laura @laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
CodePudding user response:
An approach could be:
library(tidyverse)
df1 |>
left_join(df2) |>
group_by(state) |>
mutate(versus = if_else(str_detect(tolower(versus), tolower(name)), versus, NA_character_)) |>
ungroup()
Output:
# A tibble: 6 × 4
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George @ george v. SALLY
2 2002 TN Sally @ george v. SALLY
3 1999 AL David @laura v. dAvid
4 1999 AL Laura @laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Update: Almost the same approach as @langtang within seconds. keeping this answer for inspiration.