Home > other >  How to detect names from a dataframe that are included totally or partially in names from another da
How to detect names from a dataframe that are included totally or partially in names from another da

Time:12-27

I would like to detect the whole names of cities1 from df1 that are included totally or partially in the names of cities2 from df2 (i.e. considering cities as characters strings), in such a way that the common characters are displayed in a new column “match”, and using dplyr and stringr, ideally.

city1 <- c("boston", "cambridge", "houston")
df1 <- as.data.frame(city1)
df1

city2 <- c("atlanta", "denver", "cambridge", "cambridgeuk", "london", "york")
df2 <- as.data.frame(city2)
df2

What I would like (note the difference between cambridge and cambridgeuk):

  city1     city2       match    
  <chr>     <chr>       <chr>    
1 boston    atlanta     NA       
2 cambridge denver      NA       
3 houston   cambridge   cambridge
4 NA        cambridgeuk cambridge
5 NA        london      NA       
6 NA        york        NA

Thanks for help

CodePudding user response:

If the matches are unique we can try:

library(tidyverse)

cities1 <- df1$city1 %>%
 str_c(collapse = '|')
df2 %>%
 mutate(match = str_extract(city2, cities1))
#>         city2     match
#> 1     atlanta      <NA>
#> 2      denver      <NA>
#> 3   cambridge cambridge
#> 4 cambridgeuk cambridge
#> 5      london      <NA>
#> 6        york      <NA>

But suppose we have cambridgehouston inside of df2$city2, then it's a bit more complicated. I'm sure that there are more clear ways to do it, but this can work.

library(tidyverse)

city1 <- c("boston", "cambridge", "houston")
df1 <- as.data.frame(city1)
df1
#>       city1
#> 1    boston
#> 2 cambridge
#> 3   houston

city2 <- c("atlanta", "denver", "cambridgehouston", "cambridgeuk", "london", "york")
df2 <- as.data.frame(city2)
df2
#>              city2
#> 1          atlanta
#> 2           denver
#> 3 cambridgehouston
#> 4      cambridgeuk
#> 5           london
#> 6             york


result_match <-
  map_dfc(df1$city1, ~
  tibble(!!.x := str_extract(df2$city2, .) %>% replace_na(""))) %>%
  rowwise() %>%
  transmute(match = c_across(all_of(df1$city1)) %>% str_c(collapse = " ") %>%
    str_trim() %>%
    na_if(""))

bind_cols(df2, result_match)
#>              city2             match
#> 1          atlanta              <NA>
#> 2           denver              <NA>
#> 3 cambridgehouston cambridge houston
#> 4      cambridgeuk         cambridge
#> 5           london              <NA>
#> 6             york              <NA>

Created on 2021-12-24 by the reprex package (v2.0.1)

  • Related