Home > Software engineering >  R full_join with multiple column criteria
R full_join with multiple column criteria

Time:11-26

I am trying to full join several dataframes (d1, d2,...) based on several conditions: Either the ID has to match exactly beetween the two dataframes (id1 and id2) OR the location value of d2 has to be between the min and max columns of d1.

data1 <- data.frame(id1 = c("A","B","C","D"),                  
                    location1 = c(123,247,335,454),
                    min1 = c(100,200,300,400),
                    max1 = c(199,299,399,499))


data2 <- data.frame(id2 = c("A","E","F"),                  
                    location2 = c(123,221,522),
                    min2 = c(100,212,500),
                    max2 = c(199,221,599))


What I want is:

   id1  location1   min1    max1    id2 location2   min2    max2
1   A         123   100      199    A   123          100    199
2   B         247   200      299    E   221          212    221
3   C         335   300      399    NA  NA            NA    NA
4   D         454   400      499    NA  NA            NA    NA
5   NA        NA    NA        NA    F   522           500   599
  • row1: ids are a perfect match, look no further
  • row2: ids are not a match but location2 is between min1 and max1,so this appears on same row
  • row3/4: no match, retained from d1
  • row5:nomatch, retained from d2

CodePudding user response:

The sqldf package provides a possible solution.

The union is required because sqldf() does not support full outer joins. I also use dplyr's arrange() at the end because it orders the rows as per your desired output, whereas order by in sqldf() puts the missing values first.

library(sqldf)
library(dplyr)

sqldf(
    "select * 
    from data1 d1 left join data2 d2
            on d1.id1 = d2.id2 or (d1.id1 <> d2.id2 and d1.min1 <= d2.location2 and d1.max1 >= d2.location2)
    
    union
    
    select d1.*, d2.*
    from data2 d2 left join data1 d1 
            on d1.id1 = d2.id2 or (d1.id1 <> d2.id2 and d1.min1 <= d2.location2 and d1.max1 >= d2.location2)
    where d1.id1 is null"
) %>% 
    arrange(id1, id2)

# id1 location1 min1 max1  id2 location2 min2 max2
# 1    A       123  100  199    A       123  100  199
# 2    B       247  200  299    E       221  212  221
# 3    C       335  300  399 <NA>        NA   NA   NA
# 4    D       454  400  499 <NA>        NA   NA   NA
# 5 <NA>        NA   NA   NA    F       522  500  599

CodePudding user response:

Here's a method that I thought was not the best but got the job done:

data1 <- data.frame(id1 = c("A","B","C","D"),                  
                    location1 = c(123,247,335,454),
                    min1 = c(100,200,300,400),
                    max1 = c(199,299,399,499))


data2 <- data.frame(id2 = c("A","E","F"),                  
                    location2 = c(123,221,522),
                    min2 = c(100,212,500),
                    max2 = c(199,221,599))

library(dplyr)    

aux_data <-
  data1 %>% 
  full_join(data2, by = character()) %>% 
  filter(
    # id's are a perfect match
    id1 == id2 |
      # id's are not a match but location2 is between min1 and max1
      location2 >= min1 & location2 <= max1 
  ) 

aux_id <- unique(c(aux_data$id1,aux_data$id2))

aux_data %>% 
  # no match, retained from d1
  bind_rows(
    data1 %>% anti_join(data2, by = c("id1" = "id2")) %>% filter(!(id1 %in% aux_id))
  ) %>% 
  bind_rows(
  # no match, retained from d2  
    data2 %>% anti_join(data1, by = c("id2" = "id1")) %>% filter(!(id2 %in% aux_id))
  )
  
   id1 location1 min1 max1  id2 location2 min2 max2
1    A       123  100  199    A       123  100  199
2    B       247  200  299    E       221  212  221
3    C       335  300  399 <NA>        NA   NA   NA
4    D       454  400  499 <NA>        NA   NA   NA
5 <NA>        NA   NA   NA    F       522  500  599
  • Related