Home > Software engineering >  Inner join data.frames using multiple arguments in R dplyr
Inner join data.frames using multiple arguments in R dplyr

Time:01-10

I have two data frames that look like this

library(dplyr)

#> Attaching package: 'dplyr'


df1 <- tibble(chrom=c(1,1,1,2,2,2),
              start=c(100,200,300,100,200,300),
              end=c(150,250,350,120,220,320))

df2 <- tibble(chrom=c(1,1,1,2,2,2),
              start2=c(100,50,280,100,10,200),
              end2=c(125,100,320,115,15,350))

df1 
#> # A tibble: 6 × 3
#>   chrom start   end
#>   <dbl> <dbl> <dbl>
#> 1     1   100   150
#> 2     1   200   250
#> 3     1   300   350
#> 4     2   100   120
#> 5     2   200   220
#> 6     2   300   320
df2
#> # A tibble: 6 × 3
#>   chrom start2  end2
#>   <dbl>  <dbl> <dbl>
#> 1     1    100   125
#> 2     1     50   100
#> 3     1    280   320
#> 4     2    100   115
#> 5     2     10    15
#> 6     2    200   350

Created on 2023-01-09 with reprex v2.0.2

When I am trying to join based on the following conditions I am taking a huge error any ideas why

inner_join(df2, df1, by = join_by(chrom, start< end2, end > start2))

The error

Error in `inner_join()`:
! Join columns in `x` must be present in the data.
✖ Problem with `start` and `end`.
Run `rlang::last_error()` to see where the error occurred.

This is the expected outcome

  chrom start end start2 end2
1:     1   100 150    100  125
2:     1    NA  NA     50  100
3:     1   300 350    280  320
4:     2   100 120    100  115
5:     2    NA  NA     10   15
6:     2   200 220    200  350
7:     2   300 320    200  350

CodePudding user response:

You list df2 first in the inner_join, its variables need to be listed on the LHS of the comparisons.

You can either swap df1/df2 or swap the order of the comparison variables (effectively the same given an inner join):

inner_join(df2, df1, by = join_by(chrom, end2 > start, start2 < end))
# # A tibble: 5 × 5
#   chrom start2  end2 start   end
#   <dbl>  <dbl> <dbl> <dbl> <dbl>
# 1     1    100   125   100   150
# 2     1    280   320   300   350
# 3     2    100   115   100   120
# 4     2    200   350   200   220
# 5     2    200   350   300   320

inner_join(df1, df2, by = join_by(chrom, start < end2, end > start2))
# # A tibble: 5 × 5
#   chrom start   end start2  end2
#   <dbl> <dbl> <dbl>  <dbl> <dbl>
# 1     1   100   150    100   125
# 2     1   300   350    280   320
# 3     2   100   120    100   115
# 4     2   200   220    200   350
# 5     2   300   320    200   350

Your expected output suggests a sided-join, where these two are effectively equivalent:

left_join(df2, df1, by = join_by(chrom, end2 > start, start2 < end))
# # A tibble: 7 × 5
#   chrom start2  end2 start   end
#   <dbl>  <dbl> <dbl> <dbl> <dbl>
# 1     1    100   125   100   150
# 2     1     50   100    NA    NA
# 3     1    280   320   300   350
# 4     2    100   115   100   120
# 5     2     10    15    NA    NA
# 6     2    200   350   200   220
# 7     2    200   350   300   320
right_join(df1, df2, by = join_by(chrom, start < end2, end > start2))

CodePudding user response:

Is this what you're looking for?

df3 <- inner_join(df2, df1, by = "chrom") %>%
filter(start< end2, end > start2)
  • Related