I have a dataset of text strings that look something like this:
strings <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger",
"Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert", "Jessica Wright Htx Satx",
"Julie Green Lifestyle Blogger", "Mike S Thomas Football Player",
"Tiny Fitness Houston Studio")), class = "data.frame", row.names = c(NA,
-8L))
I am trying to evaluate matches in those strings against two different datasets called firstname
and lastname
that look as such:
firstname <- structure(list(firstnames = c("Jennifer", "Lisa", "Tina", "Jamie",
"Jessica", "Julie", "Mike", "George")), class = "data.frame", row.names = c(NA,
-8L))
lastname <- structure(list(lastnames = c("Hancock", "Smith", "Houston", "Fay",
"Tucker", "Wright", "Green", "Thomas")), class = "data.frame", row.names = c(NA,
-8L))
First thing I would like to do is remove everything after the first three words in each string, so "Jennifer Rae Hancock Brown"
would just become "Jessica Rae Hancock"
and "Lisa Smith Houston Blogger"
would become "Lisa Smith Houston"
After that, I then want to evaluate the first word of each string to see if it matches to anything in the firstname
dataframe. If it does match, it creates a new column called in the final table called firstname
with the result. If it doesn't match, the result is simply "N/A".
After that, I'd like to then evaluate the remaining words against the lastname
dataframe. There can be multiple matches (As seen in the "Lisa Smith Houston" example) and if that's the case, both results will be stored in the final dataframe.
The final dataframe should look like this:
final <- structure(list(string = c("Jennifer Rae Hancock Brown", "Lisa Smith Houston Blogger",
"Lisa Smith Houston Blogger", "Tina Fay Las Cruces", "\t\nJamie Tucker Style Expert",
"Jessica Wright Htx Satx", "Julie Green Lifestyle Blogger", "Mike S Thomas Football Player",
"Tiny George Fitness Houston Studio"), firstname = c("Jennifer",
"Lisa", "Lisa", "Tina", "Jamie", "Jessica", "Julie", "Mike",
"N/A"), lastname = c("Hancock", "Smith", "Houston", "Fay", "Tucker",
"Wright", "Green", "Thomas", "N/A")), class = "data.frame", row.names = c(NA,
-9L))
What would be the most effective way to go about doing this?
CodePudding user response:
We may use str_extract_all
on the substring of 'string2' with pattern
as the firstnames, lastnames vector converted to a single string with |
(OR as delimiter) and return a list
of vectors, then use unnest
to convert the list
to vector
library(dplyr)
library(stringr)
library(tidyr)
strings %>%
mutate(string2 = str_extract(trimws(string), "^\\S \\s \\S \\s \\S "),
firstname = str_extract_all(string2,
str_c(firstname$firstnames, collapse = "|")),
lastname =str_extract_all(string2,
str_c(lastname$lastnames, collapse = "|")) ) %>%
unnest(where(is.list), keep_empty = TRUE) %>%
select(-string2)%>%
mutate(lastname = case_when(complete.cases(firstname) ~ lastname))
-output
# A tibble: 9 × 3
string firstname lastname
<chr> <chr> <chr>
1 "Jennifer Rae Hancock Brown" Jennifer Hancock
2 "Lisa Smith Houston Blogger" Lisa Smith
3 "Lisa Smith Houston Blogger" Lisa Houston
4 "Tina Fay Las Cruces" Tina Fay
5 "\t\nJamie Tucker Style Expert" Jamie Tucker
6 "Jessica Wright Htx Satx" Jessica Wright
7 "Julie Green Lifestyle Blogger" Julie Green
8 "Mike S Thomas Football Player" Mike Thomas
9 "Tiny Fitness Houston Studio" <NA> <NA>
OP's expected
> final
string firstname lastname
1 Jennifer Rae Hancock Brown Jennifer Hancock
2 Lisa Smith Houston Blogger Lisa Smith
3 Lisa Smith Houston Blogger Lisa Houston
4 Tina Fay Las Cruces Tina Fay
5 \t\nJamie Tucker Style Expert Jamie Tucker
6 Jessica Wright Htx Satx Jessica Wright
7 Julie Green Lifestyle Blogger Julie Green
8 Mike S Thomas Football Player Mike Thomas
9 Tiny George Fitness Houston Studio N/A N/A