Home > Enterprise >  Joining two dataframes on a condition (grepl)
Joining two dataframes on a condition (grepl)

Time:10-15

I'm looking to join two dataframes based on a condition, in this case, that one string is inside another. Say I have two dataframes,

df1 <- data.frame(fullnames=c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"), 
                  ages = c(30, 51, 45, 38, 20))

       fullnames ages
1       Jane Doe   30
2 Mr. John Smith   51
3 Nate Cox, Esq.   45
4   Bill Lee III   38
5 Ms. Kate Smith   20

df2 <- data.frame(lastnames=c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"), 
                  ages=c(30, 45, 20, 28, 51, 38), 
                  homestate=c("NJ", "CT", "MA", "RI", "MA", "NY"))
  lastnames ages homestate
1       Doe   30        NJ
2       Cox   45        CT
3     Smith   20        MA
4      Jung   28        RI
5     Smith   51        MA
6       Lee   38        NY

I want to do a left join on these two dataframes on ages and the row in which df2$lastnames is contained within df1$fullnames. I thought fuzzy_join might do it, but I don't think it liked my grepl:

joined_dfs <- fuzzy_join(df1, df2, by = c("ages", "fullnames"="lastnames"), 
                           match_fun = c("=", "grepl()"),
                           mode="left")
Error in which(m) : argument to 'which' is not logical

Desired result: a dataframe identical to the first but with a "homestate" column appended. Any ideas?

CodePudding user response:

You had the right idea, but you went wrong in your interpretation of the match_fun parameter in fuzzyjoin::fuzzy_join(). Per the documentation, match_fun should be a

Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in by (if a named list, it uses the names in x). If only one function is given it is used on all column pairs.

Solution

Simply this should do the trick. For conceptual clarity, I've typographically aligned the by columns with the functions used to match them:

joined_dfs <- fuzzy_join(
  df1, df2,
  by        =    c("ages", "fullnames" = "lastnames"),
#                  |----|  |-----------------------|
  match_fun = list(`==`  , stringr::str_detect      ),
  mode = "left"
)

Given your sample data reproduced here

df1 <- data.frame(
  fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
  ages = c(30, 51, 45, 38, 20)
)

df2 <- data.frame(
  lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
  ages = c(30, 45, 20, 28, 51, 38),
  homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)

this solution should produce the following result for joined_dfs:

       fullnames ages.x lastnames ages.y homestate
1       Jane Doe     30       Doe     30        NJ
2 Mr. John Smith     51     Smith     51        MA
3 Nate Cox, Esq.     45       Cox     45        CT
4   Bill Lee III     38       Lee     38        NY
5 Ms. Kate Smith     20     Smith     20        MA

Note

Because each age is coincidentally a unique key, the following join on only *names

fuzzy_join(
  df1, df2,
  by = c("fullnames" = "lastnames"),
  match_fun = stringr::str_detect,
  mode = "left"
)

will better illustrate the behavior of matching on substrings:

       fullnames ages.x lastnames ages.y homestate
1       Jane Doe     30       Doe     30        NJ
2 Mr. John Smith     51     Smith     20        MA
3 Mr. John Smith     51     Smith     51        MA
4 Nate Cox, Esq.     45       Cox     45        CT
5   Bill Lee III     38       Lee     38        NY
6 Ms. Kate Smith     20     Smith     20        MA
7 Ms. Kate Smith     20     Smith     51        MA

Where You Went Wrong

Error in Type

The value passed to match_fun should be either (the symbol for) a function

fuzzyjoin::fuzzy_join(
  # ...
  match_fun = grepl
  # ...
)

or a list of such (symbols for) functions:

fuzzyjoin::fuzzy_join(
  # ...
  match_fun = list(`=`, grepl)
  # ...
)

Rather than a list of symbols

match_fun = list(=, grepl)

you provided a vector of character strings:

match_fun = c("=", "grepl()")

Error in Syntax

The user should name the functions

`=`
grepl

yet you attempted to call them:

=
grepl()

Naming them will pass the functions themselves to match_fun, as intended, whereas calling them will pass their return values*. In R, an operator like = is named using backticks: `=`.

* Assuming the calls didn't fail with errors. Here, they would fail.

Inappropriate Functions

To compare two values for equality, here the character vectors df1$fullnames and df2$lastnames, you should use the relational operator ==; yet you attempted to use the assignment operator =.

Furthermore grepl() is not vectorized in quite the way match_fun desires. While its second argument (x) is indeed a vector

a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.

its first argument (pattern) is (treated as) a single character string:

character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr, gregexpr and regexec.

Thus, grepl() is not a

Vectorized function given two columns...

but rather a function given one string (scalar) and one column (vector) of strings.

The answer to your prayers is not grepl() but rather something like stringr::str_detect(), which is

Vectorised over string and pattern. Equivalent to grepl(pattern, x).

and which wraps stringi::str_detect().

Note

Since you're simply trying to detect whether a literal string in df1$fullnames contains a literal string in df2$lastnames, you don't want to accidentally treat the strings in df2$lastnames as regular expression patterns. Now your df2$lastnames column is statistically unlikely to contain names with special regex characters; with the lone exception of -, which is interpreted literally unless within [], which are very unlikely to be found in a last name.

If you're still worried about accidental regex, you might want to consider alternative search methods with stringi::str_detect_fixed() or stringi::str_detect_coll(). These perform literal matching, respectively by either byte or "canonical equivalence"; the latter adjusts for locale and special characters, in keeping with natural language processing.

CodePudding user response:

This seems to work given your two dataframes:

library(dplyr)
df1 %>%
  mutate(
    # create new column that gets rid of strings after last name:
    lastnames = sub("\\sI{1,3}$|,. $", "", fullnames),
    # grab last names:
    lastnames = sub(".*?(\\w )$", "\\1", lastnames)) %>%
  # join the two dataframes by the now-common column `lastnames`:
  left_join(., df2, by = "lastnames") %>%
  # deselect obsolete column:
  select(-ages.y) %>%
  # rename `ages` column:
  rename(ages = ages.x)
       fullnames ages lastnames homestate
1       Jane Doe   30       Doe        NJ
2 Mr. John Smith   51     Smith        MA
3 Mr. John Smith   51     Smith        MA
4 Nate Cox, Esq.   45       Cox        CT
5   Bill Lee III   38       Lee        NY
6 Ms. Kate Smith   20     Smith        MA
7 Ms. Kate Smith   20     Smith        MA

If you want lastnamesremoved too just use:

select(-c(ages.y, lastnames))    
  • Related