This is an extension of my previous post.
I have the following dataframes (df1 and df2) that I'm trying to merge:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("Molly Homes, Jane Doe", "Sally", "David", "Laura", "John", "Kate")
df1 <- data.frame(year, state, name)
year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("Homes (v. Vista)", "@laura v. dAvid")
df2 <- data.frame(year, state, versus)
And I df4 is my ideal output:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("Molly Homes, Jane Doe", "Sally", "David", "Laura", "John", "Kate")
versus <- c("Homes (v. Vista)", "@ george v. SALLY", "@laura v. dAvid", "@laura v. dAvid", NA, NA)
df4 <- data.frame(year, state, name, versus)
The kind responders on the last post suggested this (and a variation):
library(dplyr)
df3 <- left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))
The problem with the above code is that it doesn't match subsets. Ideally, I'd like grepl(x, y) to match each other, vice versa. If x is in y and/or y is in x, then it's TRUE and results in the value in the "versus" column.
CodePudding user response:
fuzzyjoin
is meant for regex searches like this :-)
library(dplyr)
# library(tidyr) # unnest
# library(fuzzyjoin) # fuzzy_*_join
df1 %>%
mutate(
rn = row_number(),
ptn = strsplit(name, "[ ,] ")
) %>%
tidyr::unnest(ptn) %>%
fuzzyjoin::fuzzy_left_join(df2,
by = c("year" = "year", "state" = "state", "ptn" = "versus"),
match_fun = list(`==`, `==`, function(...) Vectorize(grepl)(..., ignore.case = TRUE))
) %>%
group_by(rn, year = year.x, state = state.x, name) %>%
summarize(versus = na.omit(versus)[1], .groups = "drop") %>%
select(-rn)
# # A tibble: 6 x 4
# year state name versus
# <chr> <chr> <chr> <chr>
# 1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
# 2 2002 TN Sally NA
# 3 1999 AL David @laura v. dAvid
# 4 1999 AL Laura @laura v. dAvid
# 5 1997 CA John NA
# 6 2002 TN Kate NA
CodePudding user response:
We need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus
column. Here is one simple way to do this:
- Create function (
f(n,v)
), which takes stringsn
andv
, extracts the whole words (wrds
) fromn
, and then counts how many of them are found inv
. Returns TRUE if this count exceeds 0
f <- function(n,v) {
wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}
- Left join the original frames, and apply
f()
by row, retaining versus if one or more whole words fromname
are found inveruss
, else set to NA
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(f(name, versus), versus,NA_character_))
Output:
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David @laura v. dAvid
4 1999 AL Laura @laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Input:
df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997",
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe",
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA,
-6L))
df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "@laura v. dAvid")), class = "data.frame", row.names = c(NA,
-2L))