Home > Software engineering >  How to fuzzy lookup a string in one column to another column ignoring sub-setted words
How to fuzzy lookup a string in one column to another column ignoring sub-setted words

Time:01-20

I have the following 2 dataframes vendor_list and firm_list:

MARKET_ID <- c(1,2,3,4,5)
MARKET_NAME <- c("DELHI","MUMBAI","BANGALORE","KOLKATA","CHENNAI")
vendor_list <- data.frame(MARKET_ID,MARKET_NAME)
MARKET_NAME <- c("DELHI MUNICIPAL CORP","DELHI","MUMBAI","BENGALURU","BANGALORES","CITYKOLKATA")
POPULATION <- c(1000,2000,3000,4000,5000,6000)
firm_list <- data.frame(MARKET_NAME,POPULATION)

I need to search for strings in MARKET_NAME column in vendor_list dataframe in MARKET_NAME column in firm_list dataframe. But there are certain conditions:

It should only show as a match if the string is present as a stand alone block, i.e. it should not be a sub-set of the word.

So,

  1. The match of DELHI to DELHI MUNICIPAL CORP is TRUE
  2. The match of DELHI to DELHI is TRUE
  3. The match of BANGALORE to BANGALORES is FALSE as BANGALORE is a sub-set of BANGALORES
  4. The match of KOLKATA to CITYKOLKATA is FALSE as KOLKATA is a sub-set of CITYKOLKATA

Thus, the final dataframe final_market_info after lookup should look like this:

| MARKET_ID| MARKET_NAME.x |  MARKET_NAME.y      | POPULATION |
|    1     |     DELHI     | DELHI MUNICIPAL CORP| 1000       |
|    1     |     DELHI     |   DELHI             | 2000       |
|    2     |     MUMBAI    |   MUMBAI            | 3000       |

I had tried stringdist_join in stringr package using lcs and jw method but it was not giving me correct result like this.

CodePudding user response:

Is this what you need?

firm_list %>%
  mutate(match = str_extract(MARKET_NAME, str_c("\\b", vendor_list$MARKET_NAME, collapse = "|", "\\b"))) %>%
  left_join(., vendor_list %>% rename(match = MARKET_NAME), by = "match")
           MARKET_NAME POPULATION  match MARKET_ID
1 DELHI MUNICIPAL CORP       1000  DELHI         1
2                DELHI       2000  DELHI         1
3               MUMBAI       3000 MUMBAI         2
4            BENGALURU       4000   <NA>        NA
5           BANGALORES       5000   <NA>        NA
6          CITYKOLKATA       6000   <NA>        NA

The point here is that the elements in vendor_list$MARKET_NAME are wrapped into word \\boundary markers to get exact matches and concatenated as an alternation pattern.

To remove the rows without matches, use inner_join instead of left_join:

firm_list %>%
   mutate(match = str_extract(MARKET_NAME, str_c("\\b", vendor_list$MARKET_NAME, collapse = "|", "\\b"))) %>%
   inner_join(., vendor_list %>% rename(match = MARKET_NAME), by = "match")
           MARKET_NAME POPULATION  match MARKET_ID
1 DELHI MUNICIPAL CORP       1000  DELHI         1
2                DELHI       2000  DELHI         1
3               MUMBAI       3000 MUMBAI         2
  • Related