I am trying to evaluate the following text strings:
text_example <- structure(list(text_string = c(
"REALTOR IN DALLAS, TX! CALL NOW FOR SHOWINGS", "BORN IN EL PASO, TX AND CURRENTLY LIVING IN HOUSTON"
)), class = "data.frame", row.names = c(NA, -2L))
I am trying to use a source file to extract the city and state names out of the text string. Here is the source file:
city_example <- structure(list(city_ex = c("DALLAS, TX", "EL PASO, TX"), city = c("DALLAS",
"EL PASO"), State = c("TX", "TX")), class = "data.frame", row.names = c(NA,
-2L))
I would like the final output to look like this:
output_example <- structure(list(text_string = c("BORN IN EL PASO, TX AND CURRENTLY LIVING IN HOUSTON",
"REALTOR IN DALLAS, TX! CALL NOW FOR SHOWINGS", ""), city = c("EL PASO",
"DALLAS", ""), state = c("TX", "TX", "")), class = "data.frame", row.names = c(NA,
-3L))
But when I run the following it returns zero results, which should not be the case:
output_example <- text_example %>%
separate_rows(text_string) %>%
left_join(city_example, by = c("text_string" = "city_ex")) %>%
filter(!is.na(state)) %>% dplyr::select(text_string, city, state) %>% distinct()
What is it about that code that doesn't appear to be working and how can it best be fixed?
CodePudding user response:
You can use fuzzyjoin
:
fuzzyjoin::regex_left_join(text_example, city_example, by = c("text_string" = "city_ex"))
# text_string city_ex city State
# 1 COLLEGE STUDENT LIVING IN HOUSTON <NA> <NA> <NA>
# 2 REALTOR IN DALLAS, TX! CALL NOW FOR SHOWINGS DALLAS, TX DALLAS TX
# 3 BORN IN EL PASO, TX AND CURRENTLY LIVING IN HOUSTON EL PASO, TX EL PASO TX
Your output_example
suggested that the non-matching row should be the empty string, so you can reproduce that with:
library(dplyr)
fuzzyjoin::regex_left_join(text_example, city_example, by = c("text_string" = "city_ex")) %>%
mutate(across(everything(), ~ if_else(is.na(city_ex), "", .))) %>%
select(-city_ex)
# text_string city State
# 1
# 2 REALTOR IN DALLAS, TX! CALL NOW FOR SHOWINGS DALLAS TX
# 3 BORN IN EL PASO, TX AND CURRENTLY LIVING IN HOUSTON EL PASO TX
... though I personally prefer having NA
values (an empty string does have value at times) or remove the row (using regex_inner_join
).