I am trying to full join several dataframes (d1, d2,...) based on several conditions: Either the ID has to match exactly beetween the two dataframes (id1 and id2) OR the location value of d2 has to be between the min and max columns of d1.
data1 <- data.frame(id1 = c("A","B","C","D"),
location1 = c(123,247,335,454),
min1 = c(100,200,300,400),
max1 = c(199,299,399,499))
data2 <- data.frame(id2 = c("A","E","F"),
location2 = c(123,221,522),
min2 = c(100,212,500),
max2 = c(199,221,599))
What I want is:
id1 location1 min1 max1 id2 location2 min2 max2
1 A 123 100 199 A 123 100 199
2 B 247 200 299 E 221 212 221
3 C 335 300 399 NA NA NA NA
4 D 454 400 499 NA NA NA NA
5 NA NA NA NA F 522 500 599
- row1: ids are a perfect match, look no further
- row2: ids are not a match but location2 is between min1 and max1,so this appears on same row
- row3/4: no match, retained from d1
- row5:nomatch, retained from d2
CodePudding user response:
The sqldf
package provides a possible solution.
The union
is required because sqldf()
does not support full outer joins. I also use dplyr
's arrange()
at the end because it orders the rows as per your desired output, whereas order by
in sqldf()
puts the missing values first.
library(sqldf)
library(dplyr)
sqldf(
"select *
from data1 d1 left join data2 d2
on d1.id1 = d2.id2 or (d1.id1 <> d2.id2 and d1.min1 <= d2.location2 and d1.max1 >= d2.location2)
union
select d1.*, d2.*
from data2 d2 left join data1 d1
on d1.id1 = d2.id2 or (d1.id1 <> d2.id2 and d1.min1 <= d2.location2 and d1.max1 >= d2.location2)
where d1.id1 is null"
) %>%
arrange(id1, id2)
# id1 location1 min1 max1 id2 location2 min2 max2
# 1 A 123 100 199 A 123 100 199
# 2 B 247 200 299 E 221 212 221
# 3 C 335 300 399 <NA> NA NA NA
# 4 D 454 400 499 <NA> NA NA NA
# 5 <NA> NA NA NA F 522 500 599
CodePudding user response:
Here's a method that I thought was not the best but got the job done:
data1 <- data.frame(id1 = c("A","B","C","D"),
location1 = c(123,247,335,454),
min1 = c(100,200,300,400),
max1 = c(199,299,399,499))
data2 <- data.frame(id2 = c("A","E","F"),
location2 = c(123,221,522),
min2 = c(100,212,500),
max2 = c(199,221,599))
library(dplyr)
aux_data <-
data1 %>%
full_join(data2, by = character()) %>%
filter(
# id's are a perfect match
id1 == id2 |
# id's are not a match but location2 is between min1 and max1
location2 >= min1 & location2 <= max1
)
aux_id <- unique(c(aux_data$id1,aux_data$id2))
aux_data %>%
# no match, retained from d1
bind_rows(
data1 %>% anti_join(data2, by = c("id1" = "id2")) %>% filter(!(id1 %in% aux_id))
) %>%
bind_rows(
# no match, retained from d2
data2 %>% anti_join(data1, by = c("id2" = "id1")) %>% filter(!(id2 %in% aux_id))
)
id1 location1 min1 max1 id2 location2 min2 max2
1 A 123 100 199 A 123 100 199
2 B 247 200 299 E 221 212 221
3 C 335 300 399 <NA> NA NA NA
4 D 454 400 499 <NA> NA NA NA
5 <NA> NA NA NA F 522 500 599