I have a data.frame of names such as the following which has a sample of some surnames followed by an initial (e.g. Smith S or Lopez-Garcia M):

df<-data.frame(names=c("Adu-Amankwah E",
"Smith Dawson E",
"Lopez-Garcia M",
"Lopez Garcia MA",
"Garcia MAC",
"Lopez Garcia MA",
"Garcia MAC"))

I would like to pull out all those names with a double-barreled surname and do a little cleaning:

pull out any with a hyphen (-) or two surnames (e.g. Lopez Garcia).
I need to replace any of the following: Lopez Garcia MA,Lopez-Garcia MA or Garcia MAC with Lopez-Garcia M. And Smith Dawson E should be Smith-Dawson E.

Output would look like:

df<-data.frame(names=c("Adu-Amankwah E",
"Smith-Dawson E",
"Lopez-Garcia M",
"Lopez-Garcia M",
"Lopez-Garcia M",
"Lopez-Garcia M",
"Lopez-Garcia M"))

CodePudding user response：

As I mentioned in my comments, the challenge here is not so much parsing the character strings as it is defining the logic to

associate variants of the same name (ex. "Garcia MAC", "Lopez Garcia MA") under a representative label ("Lopez-Garcia M"); and still
avoid lumping together similar variants (like "Garcia A") of different names (like "Andy Garcia").

As such, your best approach might be to define a mapping table for known variations on the names.

Literal Mapping

A literal mapping involves typing out every known variant, next to the name it truly represents.

mapping_lit <- data.frame(
  True_Name = c("Adu-Amankwah E", "Smith-Dawson E", "Lopez-Garcia M", "Lopez-Garcia M",  "Lopez-Garcia M"),
  Variant   = c("Adu-Amankwah E", "Smith Dawson E", "Lopez-Garcia M", "Lopez Garcia MA", "Garcia MAC")
)

mapping_lit
#>        True_Name         Variant
#> 1 Adu-Amankwah E  Adu-Amankwah E
#> 2 Smith-Dawson E  Smith Dawson E
#> 3 Lopez-Garcia M  Lopez-Garcia M
#> 4 Lopez-Garcia M Lopez Garcia MA
#> 5 Lopez-Garcia M      Garcia MAC

Once you have your mapping, a simple dplyr::*_join() should do the trick

library(dplyr)

# The LEFT JOIN preserves any names without matches, so you can handle them as you wish.
left_join(
  df,
  mapping_lit,
  by = c("names" = "Variant")
)

with the following result:

            names      True_Name
1  Adu-Amankwah E Adu-Amankwah E
2  Smith Dawson E Smith-Dawson E
3  Lopez-Garcia M Lopez-Garcia M
4 Lopez Garcia MA Lopez-Garcia M
5      Garcia MAC Lopez-Garcia M
6 Lopez Garcia MA Lopez-Garcia M
7      Garcia MAC Lopez-Garcia M

Regex Mapping

If you're skilled enough with regular expressions, you could just define one regex to represent all variants on each True_Name:

mapping_rgx <- data.frame(
  True_Name = c("Adu-Amankwah E",             "Smith-Dawson E",             "Lopez-Garcia M"),
  Pattern   = c("^(Adu[- ]?)?Amankwah( E)?$", "^(Smith[- ]?)?Dawson( E)?$", "^(Lopez[- ]?)?Garcia( M(AC?)?)?$")
)

mapping_rgx
#>        True_Name                          Pattern
#> 1 Adu-Amankwah E       ^(Adu[- ]?)?Amankwah( E)?$
#> 2 Smith-Dawson E       ^(Smith[- ]?)?Dawson( E)?$
#> 3 Lopez-Garcia M ^(Lopez[- ]?)?Garcia( M(AC?)?)?$

Once you have this mapping, you'll need a fuzzyjoin::regex_*_join() to match up the variants

library(fuzzyjoin)

# The LEFT JOIN preserves any names without matches, so you can handle them as you wish.
regex_left_join(
  df,
  mapping_rgx,
  by = c("names" = "Pattern"),
  # Account for typos in capitalization.
  ignore_case = TRUE
)

with the following result:

            names      True_Name                          Pattern
1  Adu-Amankwah E Adu-Amankwah E         (Adu[- ]?)?Amankwah( E)?
2  Smith Dawson E Smith-Dawson E         (Smith[- ]?)?Dawson( E)?
3  Lopez-Garcia M Lopez-Garcia M ^(Lopez[- ]?)?Garcia( M(AC?)?)?$
4 Lopez Garcia MA Lopez-Garcia M ^(Lopez[- ]?)?Garcia( M(AC?)?)?$
5      Garcia MAC Lopez-Garcia M ^(Lopez[- ]?)?Garcia( M(AC?)?)?$
6 Lopez Garcia MA Lopez-Garcia M ^(Lopez[- ]?)?Garcia( M(AC?)?)?$
7      Garcia MAC Lopez-Garcia M ^(Lopez[- ]?)?Garcia( M(AC?)?)?$

Warning

As I also commented, I might not recommend a stringdist approach in this situation. Each name varies not only in spelling but also in structure. It's entirely possible that two similarly structured entries for two different people

Variant	True_Name
Garcia A	Andy Garcia
Garcia MAC	Lopez-Garcia M
Lopez-Garcia M	Lopez-Garcia M

have a shorter string distance than do two differently structured variations on the same name:

# Run the full gamut of methods for 'stringdist::stringdist()'.
methods <- c(
  "osa", "lv", "dl", "hamming", "lcs", "qgram",
  "cosine", "jaccard", "jw", "soundex"
)


# Display string distances for variants of the same and of different names:
rbind(
  # Compare different names.
  sapply(X = methods, FUN = function(x) {stringdist::stringdist(
    a = "Garcia MAC", b = "Garcia A",
    method = x
  )}),
  # Compare variations on the same name.
  sapply(X = methods, FUN = function(x) {stringdist::stringdist(
    a = "Garcia MAC", b = "Lopez-Garcia M",
    method = x
  )})
)

#>      osa lv dl hamming lcs qgram     cosine   jaccard         jw soundex
#> [1,]   2  2  2     Inf   2     2 0.08712907 0.2222222 0.06666667       1
#> [2,]   8  8  8     Inf   8     8 0.27831216 0.5333333 0.20952381       1