Running a string against multiple match dataframes-CodePudding

I have a dataset of text strings that look something like this:

strings <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger", 
"Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", "Jessica Wright Htx Satx", 
"Julie Green Lifestyle Blogger", "Mike S Thomas Football Player", 
"Tiny Fitness Houston Studio")), class = "data.frame", row.names = c(NA, 
-8L))

I am trying to evaluate matches in those strings against two different datasets called firstname and lastname that look as such:

firstname <- structure(list(firstnames = c("Jennifer", "Lisa", "Tina", "Jamie", 
"Jessica", "Julie", "Mike", "George")), class = "data.frame", row.names = c(NA, 
-8L))

lastname <- structure(list(lastnames = c("Hancock", "Smith", "Houston", "Fay", 
"Tucker", "Wright", "Green", "Thomas")), class = "data.frame", row.names = c(NA, 
-8L))

First thing I would like to do is remove everything after the first three words in each string, so "Jennifer Rae Hancock Brown" would just become "Jessica Rae Hancock" and "Lisa Smith Houston Blogger" would become "Lisa Smith Houston"

After that, I then want to evaluate the first word of each string to see if it matches to anything in the firstname dataframe. If it does match, it creates a new column called in the final table called firstname with the result. If it doesn't match, the result is simply "N/A".

After that, I'd like to then evaluate the remaining words against the lastname dataframe. There can be multiple matches (As seen in the "Lisa Smith Houston" example) and if that's the case, both results will be stored in the final dataframe.

The final dataframe should look like this:

final <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger", 
"Lisa Smith Houston Blogger", "Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", 
"Jessica Wright Htx Satx", "Julie Green Lifestyle Blogger", "Mike S Thomas Football Player", 
"Tiny George Fitness Houston Studio"), firstname = c("Jennifer", 
"Lisa", "Lisa", "Tina", "Jamie", "Jessica", "Julie", "Mike", 
"N/A"), lastname = c("Hancock", "Smith", "Houston", "Fay", "Tucker", 
"Wright", "Green", "Thomas", "N/A")), class = "data.frame", row.names = c(NA, 
-9L))

What would be the most effective way to go about doing this?

CodePudding user response：

We may use str_extract_all on the substring of 'string2' with pattern as the firstnames, lastnames vector converted to a single string with | (OR as delimiter) and return a list of vectors, then use unnest to convert the list to vector

library(dplyr)
library(stringr)
library(tidyr)
strings %>%
   mutate(string2 = str_extract(trimws(string), "^\\S \\s \\S \\s \\S "),
   firstname = str_extract_all(string2, 
    str_c(firstname$firstnames, collapse = "|")), 
   lastname =str_extract_all(string2, 
     str_c(lastname$lastnames, collapse = "|")) ) %>% 
   unnest(where(is.list), keep_empty = TRUE) %>% 
   select(-string2)%>% 
   mutate(lastname = case_when(complete.cases(firstname) ~ lastname))

-output

# A tibble: 9 × 3
  string                          firstname lastname
  <chr>                           <chr>     <chr>   
1 "Jennifer Rae Hancock Brown"    Jennifer  Hancock 
2 "Lisa Smith Houston Blogger"    Lisa      Smith   
3 "Lisa Smith Houston Blogger"    Lisa      Houston 
4 "Tina Fay Las Cruces"           Tina      Fay     
5 "\t\nJamie Tucker Style Expert" Jamie     Tucker  
6 "Jessica Wright Htx Satx"       Jessica   Wright  
7 "Julie Green Lifestyle Blogger" Julie     Green   
8 "Mike S Thomas Football Player" Mike      Thomas  
9 "Tiny Fitness Houston Studio"   <NA>      <NA>

OP's expected

> final
                              string firstname lastname
1         Jennifer Rae Hancock Brown  Jennifer  Hancock
2         Lisa Smith Houston Blogger      Lisa    Smith
3         Lisa Smith Houston Blogger      Lisa  Houston
4                Tina Fay Las Cruces      Tina      Fay
5      \t\nJamie Tucker Style Expert     Jamie   Tucker
6            Jessica Wright Htx Satx   Jessica   Wright
7      Julie Green Lifestyle Blogger     Julie    Green
8      Mike S Thomas Football Player      Mike   Thomas
9 Tiny George Fitness Houston Studio       N/A      N/A