Home > front end >  join / merge two data frames with flexible condition
join / merge two data frames with flexible condition

Time:12-14

I guess my question might be unclear, so I try to clarify it. Suppose I have two data frames I want to merge. Each row in both data frames are unique companies. Both data frames also contain specific columns “id_1” and “id_2” with unique identifiers for these companies. I want them to join \ merge (data frames) using “id_1” and “id_2”. Minimum reproducible example id as follows:

x <- structure(list(company = c("Apple", "Amazon", "BMW", "Audi"), 
                    id_1 = c("789879978", NA_character_, "12312312", NA_character_),
                    id_2 = c("32132131", "987978987", NA_character_, NA_character_)),
               .Names = c("company", "id_1", "id_2"), row.names = c(1:4), class = "data.frame")


y <- structure(list(id_1 = c("789879978111", "987654000", "12312312", "111123231"),
                    id_2 = c("32132131", "987978987", "098099808908", "9999999991"),
                    region = c("USA", "USA", "EU", "EU")),
               .Names = c("id_1", "id_2", "region"), row.names = c(1:4), class = "data.frame")

Desired output:

structure(list(company = c("Apple", "Amazon", "BMW", "Audi"), 
               id_1 = c("789879978", NA_character_, "12312312", NA_character_),
               id_2 = c("32132131", "987978987", NA_character_, NA_character_),
               region = c("USA", "USA", "EU", NA_character_)),
        .Names = c("company", "id_1", "id_2", "region"), row.names = c(1:4), class = "data.frame")

As you might see for Apple “id_1” in “x” data frame is not the same than in “y” one, so I need to use “id_2” to merge these rows. Same for Amazon, while for BMW “id_2” in both data frames is NA so I can’t use “id_2”, so I need to use “id_1”. Suppose for some reason I can’t change/fix it. So that I need to use both “id_1” and “id_2” to merge data frames. The problem is that if I specify both columns as a keys in merge() and/or left_join() it would no work, since it requires a perfect match of values of both specified columns. In my case I need to use combination of “id_1” (if it is possible to match) or “id_2” (if I cant merge using “id_1”).

My questions is How can I merge two data frames using flexible condition, i.e. use “id_1” if it enough to merge and/or “id_2” if I can’t merge within “id_1”

CodePudding user response:

You could try this approach:


library(tidyr)
library(dplyr)

x %>% 
  pivot_longer(-company) %>% 
  left_join(pivot_longer(y, - region)) %>% 
  pivot_wider(names_from = name, values_from = value) %>% 
  group_by(company) %>% 
  fill(!company, .direction = "downup") %>% 
  distinct()
#> Joining, by = c("name", "value")
#> # A tibble: 4 x 4
#> # Groups:   company [4]
#>   company region id_1      id_2     
#>   <chr>   <chr>  <chr>     <chr>    
#> 1 Apple   USA    789879978 32132131 
#> 2 Amazon  USA    <NA>      987978987
#> 3 BMW     EU     12312312  <NA>     
#> 4 Audi    <NA>   <NA>      <NA>

Created on 2021-12-13 by the reprex package (v2.0.1)

CodePudding user response:

left_join(x, y[-2], by = "id_1") %>% 
  left_join(y[-1], by = "id_2") %>% 
  mutate(region = coalesce(region.x, region.y)) %>% 
  select(-c(region.x, region.y))

  company      id_1      id_2 region
1   Apple 789879978  32132131    USA
2  Amazon      <NA> 987978987    USA
3     BMW  12312312      <NA>     EU
4    Audi      <NA>      <NA>   <NA>
  • Related