Find matching elements based on multiple arguments in R-CodePudding

I have a large data frame that looks like this. I want to find which genes match the others based on an overlap between the start and end positions.

library(tidyverse)

data <- data.frame(group=c(1,1,1,2,2,2),
                     genes=c("A","B","C","D","E","F"), 
                     start=c(1000,2000,3000,800,400,2000),
                     end=c(1500,2500,3500,1200,500,10000))

data
#>   group genes start   end
#> 1     1     A  1000  1500
#> 2     1     B  2000  2500
#> 3     1     C  3000  3500
#> 4     2     D   800  1200
#> 5     2     E   400   500
#> 6     2     F  2000 10000

^{Created on 2022-12-05 with reprex v2.0.2}

I want something like this.

data
#>   group genes start   end   match
#> 1     1     A  1000  1500    A-D
#> 2     1     B  2000  2500    B-F
#> 3     1     C  3000  3500    C-F
#> 4     2     D   800  1200    A-D
#> 5     2     E   400   500    NA
#> 6     2     F  2000 10000    F-C-B

I am a bit lost on how to start. Any help is appreciated

CodePudding user response：

With devel version of dplyr, we can use

library(dplyr)
library(stringr)
by <- join_by(overlaps(x$start, x$end, y$start, y$end))
full_join(data, data, by) %>% 
  group_by(genes= genes.x) %>% 
  summarise(match = if(n() ==1) NA_character_ else 
      str_c(genes.y, collapse = '-')) %>%
 left_join(data, .)

-output

  group genes start   end match
1     1     A  1000  1500   A-D
2     1     B  2000  2500   B-F
3     1     C  3000  3500   C-F
4     2     D   800  1200   A-D
5     2     E   400   500  <NA>
6     2     F  2000 10000 B-C-F

CodePudding user response：

To find which genes match each other based on an overlap between the start and end positions, you can use the fuzzyjoin package in R. This package provides tools for joining data based on fuzzy matching, which allows for inexact matching between data points.

First, you will need to install the fuzzyjoin package if you don't already have it installed:

install.packages("fuzzyjoin")

Once the package is installed, you can use the fuzzy_left_join() function to join the data based on an overlap between the start and end positions. This function takes two data frames as arguments: the first is the data frame containing the start and end positions (in your case, data), and the second is a data frame containing the group and gene information. The by argument is used to specify the columns to join on, and the match_fun argument is used to specify the matching function, which in this case is interval_overlap_join().

Here is an example of how you can use the fuzzy_left_join() function to find which genes match each other based on an overlap between the start and end positions:

library(tidyverse)
library(fuzzyjoin)

data %>%
  fuzzy_left_join(data,
                  by = c("group" = "group", "start" = "start", "end" = "end"),
                  match_fun = list(`==`, `==`, interval_overlap_join()))

This will return a new data frame containing the original data, as well as the matching genes based on the overlap between the start and end positions.