How to merge based on a subset of string in a column?-CodePudding

This is an extension of my previous post.

I have the following dataframes (df1 and df2) that I'm trying to merge:

year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("Molly Homes, Jane Doe", "Sally", "David", "Laura", "John", "Kate")

df1 <- data.frame(year, state, name)

year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("Homes (v. Vista)", "@laura v. dAvid")

df2 <- data.frame(year, state, versus)

And I df4 is my ideal output:

year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("Molly Homes, Jane Doe", "Sally", "David", "Laura", "John", "Kate")
versus <- c("Homes (v. Vista)", "@ george v. SALLY", "@laura v. dAvid", "@laura v. dAvid", NA, NA)

df4 <- data.frame(year, state, name, versus)

The kind responders on the last post suggested this (and a variation):

library(dplyr)

df3 <- left_join(df1,df2, by=c("year","state")) %>% 
  rowwise() %>% 
  mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))

The problem with the above code is that it doesn't match subsets. Ideally, I'd like grepl(x, y) to match each other, vice versa. If x is in y and/or y is in x, then it's TRUE and results in the value in the "versus" column.

CodePudding user response：

fuzzyjoin is meant for regex searches like this :-)

library(dplyr)
# library(tidyr)     # unnest
# library(fuzzyjoin) # fuzzy_*_join
df1 %>%
  mutate(
    rn = row_number(),
    ptn = strsplit(name, "[ ,] ")
  ) %>%
  tidyr::unnest(ptn) %>%
  fuzzyjoin::fuzzy_left_join(df2,
    by = c("year" = "year", "state" = "state", "ptn" = "versus"),
    match_fun = list(`==`, `==`, function(...) Vectorize(grepl)(..., ignore.case = TRUE))
  ) %>%
  group_by(rn, year = year.x, state = state.x, name) %>%
  summarize(versus = na.omit(versus)[1], .groups = "drop") %>%
  select(-rn)
# # A tibble: 6 x 4
#   year  state name                  versus          
#   <chr> <chr> <chr>                 <chr>           
# 1 2002  TN    Molly Homes, Jane Doe Homes (v. Vista)
# 2 2002  TN    Sally                 NA              
# 3 1999  AL    David                 @laura v. dAvid 
# 4 1999  AL    Laura                 @laura v. dAvid 
# 5 1997  CA    John                  NA              
# 6 2002  TN    Kate                  NA

CodePudding user response：

We need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus column. Here is one simple way to do this:

Create function (f(n,v)), which takes strings n and v, extracts the whole words (wrds) from n, and then counts how many of them are found in v. Returns TRUE if this count exceeds 0

f <- function(n,v) {
  wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
  sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}

Left join the original frames, and apply f() by row, retaining versus if one or more whole words from name are found in veruss, else set to NA

left_join(df1,df2, by=c("year","state")) %>% 
  rowwise() %>% 
  mutate(versus:=if_else(f(name, versus), versus,NA_character_))

Output:

1 2002  TN    Molly Homes, Jane Doe Homes (v. Vista)
2 2002  TN    Sally                 NA              
3 1999  AL    David                 @laura v. dAvid 
4 1999  AL    Laura                 @laura v. dAvid 
5 1997  CA    John                  NA              
6 2002  TN    Kate                  NA

Input:

df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997", 
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe", 
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA, 
-6L))

df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "@laura v. dAvid")), class = "data.frame", row.names = c(NA, 
-2L))