I would like to detect the whole names of cities1 from df1 that are included totally or partially in the names of cities2 from df2 (i.e. considering cities as characters strings), in such a way that the common characters are displayed in a new column “match”, and using dplyr and stringr, ideally.
city1 <- c("boston", "cambridge", "houston")
df1 <- as.data.frame(city1)
df1
city2 <- c("atlanta", "denver", "cambridge", "cambridgeuk", "london", "york")
df2 <- as.data.frame(city2)
df2
What I would like (note the difference between cambridge and cambridgeuk):
city1 city2 match
<chr> <chr> <chr>
1 boston atlanta NA
2 cambridge denver NA
3 houston cambridge cambridge
4 NA cambridgeuk cambridge
5 NA london NA
6 NA york NA
Thanks for help
CodePudding user response:
If the matches are unique we can try:
library(tidyverse)
cities1 <- df1$city1 %>%
str_c(collapse = '|')
df2 %>%
mutate(match = str_extract(city2, cities1))
#> city2 match
#> 1 atlanta <NA>
#> 2 denver <NA>
#> 3 cambridge cambridge
#> 4 cambridgeuk cambridge
#> 5 london <NA>
#> 6 york <NA>
But suppose we have cambridgehouston
inside of df2$city2
, then it's a bit more complicated.
I'm sure that there are more clear ways to do it, but this can work.
library(tidyverse)
city1 <- c("boston", "cambridge", "houston")
df1 <- as.data.frame(city1)
df1
#> city1
#> 1 boston
#> 2 cambridge
#> 3 houston
city2 <- c("atlanta", "denver", "cambridgehouston", "cambridgeuk", "london", "york")
df2 <- as.data.frame(city2)
df2
#> city2
#> 1 atlanta
#> 2 denver
#> 3 cambridgehouston
#> 4 cambridgeuk
#> 5 london
#> 6 york
result_match <-
map_dfc(df1$city1, ~
tibble(!!.x := str_extract(df2$city2, .) %>% replace_na(""))) %>%
rowwise() %>%
transmute(match = c_across(all_of(df1$city1)) %>% str_c(collapse = " ") %>%
str_trim() %>%
na_if(""))
bind_cols(df2, result_match)
#> city2 match
#> 1 atlanta <NA>
#> 2 denver <NA>
#> 3 cambridgehouston cambridge houston
#> 4 cambridgeuk cambridge
#> 5 london <NA>
#> 6 york <NA>
Created on 2021-12-24 by the reprex package (v2.0.1)