Home > database >  Matching a text string to a source file
Matching a text string to a source file

Time:12-14

I am trying to evaluate the following text strings:

text_example <- structure(list(text_string = c( 
"REALTOR IN DALLAS, TX! CALL NOW FOR SHOWINGS", "BORN IN EL PASO, TX AND CURRENTLY LIVING IN HOUSTON"
)), class = "data.frame", row.names = c(NA, -2L))

I am trying to use a source file to extract the city and state names out of the text string. Here is the source file:

city_example <- structure(list(city_ex = c("DALLAS, TX", "EL PASO, TX"), city = c("DALLAS", 
"EL PASO"), State = c("TX", "TX")), class = "data.frame", row.names = c(NA, 
-2L))

I would like the final output to look like this:

output_example <- structure(list(text_string = c("BORN IN EL PASO, TX AND CURRENTLY LIVING IN HOUSTON", 
"REALTOR IN DALLAS, TX! CALL NOW FOR SHOWINGS", ""), city = c("EL PASO", 
"DALLAS", ""), state = c("TX", "TX", "")), class = "data.frame", row.names = c(NA, 
-3L))

But when I run the following it returns zero results, which should not be the case:

output_example <- text_example %>%
  separate_rows(text_string) %>%
  left_join(city_example, by = c("text_string" = "city_ex")) %>%
  filter(!is.na(state)) %>% dplyr::select(text_string, city, state) %>% distinct()

What is it about that code that doesn't appear to be working and how can it best be fixed?

CodePudding user response:

You can use fuzzyjoin:

fuzzyjoin::regex_left_join(text_example, city_example, by = c("text_string" = "city_ex"))
#                                           text_string     city_ex    city State
# 1                   COLLEGE STUDENT LIVING IN HOUSTON        <NA>    <NA>  <NA>
# 2        REALTOR IN DALLAS, TX! CALL NOW FOR SHOWINGS  DALLAS, TX  DALLAS    TX
# 3 BORN IN EL PASO, TX AND CURRENTLY LIVING IN HOUSTON EL PASO, TX EL PASO    TX

Your output_example suggested that the non-matching row should be the empty string, so you can reproduce that with:

library(dplyr)
fuzzyjoin::regex_left_join(text_example, city_example, by = c("text_string" = "city_ex")) %>%
  mutate(across(everything(), ~ if_else(is.na(city_ex), "", .))) %>%
  select(-city_ex)
#                                           text_string    city State
# 1                                                                  
# 2        REALTOR IN DALLAS, TX! CALL NOW FOR SHOWINGS  DALLAS    TX
# 3 BORN IN EL PASO, TX AND CURRENTLY LIVING IN HOUSTON EL PASO    TX

... though I personally prefer having NA values (an empty string does have value at times) or remove the row (using regex_inner_join).

  •  Tags:  
  • r
  • Related