Home > database >  How to join rows that match any of multiple columns
How to join rows that match any of multiple columns

Time:10-06

I'd like to join two dataframes that list different countries. However, the countries' names are not consistent. One df has a single column for each countries' name, and the other has multiple columns with different names for that country. e.g.

df1 <- 
  tibble(
    country = c("United Kingdom", "USA", "China", "India", "Russia"),
    value = c(1:5))

df2 <- 
  tibble(
    admin_full = c(
        "United Kingdom", 
        "United States of America", 
        "People's Republic of China",  
        "India", 
        "Russian Federation"),
    admin_short = c("UK", "United States", "China", "India", "Russia"),
    admin_abbreviated = c("UK", "USA", "PRC", "India", "Russia"),
    group = c("High", "High", "Medium", "Low", "Low"))

I would like to join the dfs where df1 matches any of the names in df2. So

  • UK should match on admin_full
  • USA should match on admin_abbreviated
  • China should match on admin_short
  • India should match on admin_full, admin_short and/or admin_abbreviated
  • Russia should match on admin_short and/or admin_abbreviated

I'm working with a list of ~200 countries so I don't want to have to go through an manually pick out which one of a dozen columns has the correct format to match that particular country on. I want to match where the name in df1 matches any one of multiple columns in df2.

CodePudding user response:

It sounds like any of the three admin_ columns could match, so my approach here is to make an extra column from the admin_full (a key for that country) and bring the three variations into different rows in a country column. Then we can join the two tables, and use distinct to keep only one match for each.

library(dplyr)
left_join(df1,
  df2 %>%
    mutate(full_name = admin_full) %>%
    pivot_longer(-c(full_name, group), values_to ="country")
  ) %>%
  distinct(country, value, group, full_name)

Result

# A tibble: 5 × 4
  country        value group  full_name                 
  <chr>          <int> <chr>  <chr>                     
1 United Kingdom     1 High   United Kingdom            
2 USA                2 High   United States of America  
3 China              3 Medium People's Republic of China
4 India              4 Low    India                     
5 Russia             5 Low    Russian Federation  

Or if you don't need a "canonical" name, you could shorten:

left_join(df1, pivot_longer(df2, -group, values_to ="country")) %>%
  distinct(country, value, group)

# A tibble: 5 × 3
  country        value group 
  <chr>          <int> <chr> 
1 United Kingdom     1 High  
2 USA                2 High  
3 China              3 Medium
4 India              4 Low   
5 Russia             5 Low  
  • Related