I have the following 2 dataframes vendor_list
and firm_list
:
MARKET_ID <- c(1,2,3,4,5)
MARKET_NAME <- c("DELHI","MUMBAI","BANGALORE","KOLKATA","CHENNAI")
vendor_list <- data.frame(MARKET_ID,MARKET_NAME)
MARKET_NAME <- c("DELHI MUNICIPAL CORP","DELHI","MUMBAI","BENGALURU","BANGALORES","CITYKOLKATA")
POPULATION <- c(1000,2000,3000,4000,5000,6000)
firm_list <- data.frame(MARKET_NAME,POPULATION)
I need to search for strings in MARKET_NAME column in vendor_list
dataframe in MARKET_NAME column in firm_list
dataframe. But there are certain conditions:
It should only show as a match if the string is present as a stand alone block, i.e. it should not be a sub-set of the word.
So,
- The match of DELHI to DELHI MUNICIPAL CORP is
TRUE
- The match of DELHI to DELHI is
TRUE
- The match of BANGALORE to BANGALORES is
FALSE
as BANGALORE is a sub-set of BANGALORES - The match of KOLKATA to CITYKOLKATA is
FALSE
as KOLKATA is a sub-set of CITYKOLKATA
Thus, the final dataframe final_market_info
after lookup should look like this:
| MARKET_ID| MARKET_NAME.x | MARKET_NAME.y | POPULATION |
| 1 | DELHI | DELHI MUNICIPAL CORP| 1000 |
| 1 | DELHI | DELHI | 2000 |
| 2 | MUMBAI | MUMBAI | 3000 |
I had tried stringdist_join in stringr package using lcs and jw method but it was not giving me correct result like this.
CodePudding user response:
Is this what you need?
firm_list %>%
mutate(match = str_extract(MARKET_NAME, str_c("\\b", vendor_list$MARKET_NAME, collapse = "|", "\\b"))) %>%
left_join(., vendor_list %>% rename(match = MARKET_NAME), by = "match")
MARKET_NAME POPULATION match MARKET_ID
1 DELHI MUNICIPAL CORP 1000 DELHI 1
2 DELHI 2000 DELHI 1
3 MUMBAI 3000 MUMBAI 2
4 BENGALURU 4000 <NA> NA
5 BANGALORES 5000 <NA> NA
6 CITYKOLKATA 6000 <NA> NA
The point here is that the elements in vendor_list$MARKET_NAME
are wrapped into word \\b
oundary markers to get exact matches and concatenated as an alternation pattern.
To remove the rows without matches, use inner_join
instead of left_join
:
firm_list %>%
mutate(match = str_extract(MARKET_NAME, str_c("\\b", vendor_list$MARKET_NAME, collapse = "|", "\\b"))) %>%
inner_join(., vendor_list %>% rename(match = MARKET_NAME), by = "match")
MARKET_NAME POPULATION match MARKET_ID
1 DELHI MUNICIPAL CORP 1000 DELHI 1
2 DELHI 2000 DELHI 1
3 MUMBAI 3000 MUMBAI 2