Merging data frames by selecting for correct value-CodePudding

I have a data frame called "ref" that contains information that allows mapping of gene entrez ID to the gene's start and end positions. I have another data frame "dna" where each row contains unique mutations from samples, which gives a genomic position. I am trying to assign each position given in "dna" to map to information on "ref" in order to assign entrez ID to each mutation. I have tried a for loop to match for the same chromosome, and then select for positions in "dna" that fall between the coordinates in "ref" though I have not been successful. The "dna" dataset is over 1 million rows, so I'm not sure a for loop is an efficient solution. Note that many positions will be mapped to the same entrez ID in my real dataset. "Final" is what I want to happen- which would just add a column for entrezID according to chromosome/position. TYIA!

ref = data.frame("EntrezID" = c(1, 10, 100, 1000), "Chromosome" = c("19", "8", "20", "18"), "txStarts" = c("58345182", "18391281", "44619518", "27950965"), "txEnds" = c("58353492", "18401215", "44651758", "28177130"))

dna = data.frame("Chromosome" = c("19", "8", "20", "18"), "Pos" = c("58345186", "18401213", "44619519", "27950966"),
             "Sample" = c("HCC1", "HCC2", "HCC1", "HCC3"))

final = data.frame("Chromosome" = c("19", "8", "20", "18"), "Pos" = c("58345186", "18401213", "44619519", "27950966"),
               "Sample" = c("HCC1", "HCC2", "HCC1", "HCC3"), "EntrezID" = c(1,10,100,1000))

CodePudding user response：

One option would be to use sqldf, which should also be efficient for a large dataframe.

library(tibble)
library(sqldf)

as_tibble(sqldf("select dna.*, ref.EntrezID from dna
                join ref on dna.Pos > ref.'txStarts' and 
                dna.Pos < ref.'txEnds'"))

Another option using fuzzy_join:

library(dplyr)
library(fuzzyjoin)

dna %>%
  fuzzy_join(ref %>% select(-Chromosome), by = c("Pos" = "txStarts", "Pos" = "txEnds"),
             match_fun = list(`>`, `<`)) %>%
  select(names(dna), EntrezID)

Output

  Chromosome      Pos Sample EntrezID
1         19 58345186   HCC1        1
2          8 18401213   HCC2       10
3         20 44619519   HCC1      100
4         18 27950966   HCC3     1000

CodePudding user response：

In base apply could be used to find matches per row for Chromosome and test if Pos is in the range of txStarts txEnds.

dna$EntrezID <- apply(dna[c("Chromosome", "Pos")], 1, \(x)
  ref$EntrezID[ref$Chromosome == x["Chromosome"] &
    x["Pos"] >= ref$txStarts & x["Pos"] <= ref$txEnds][1])
dna
#  Chromosome      Pos Sample EntrezID
#1         19 58345186   HCC1        1
#2          8 18401213   HCC2       10
#3         20 44619519   HCC1      100
#4         18 27950966   HCC3     1000

CodePudding user response：

If the 'Pos', 'txStarts', 'txEnds' are numeric, then we can use non-equi join

library(data.table)
setDT(dna)[ref, EntrezID := i.EntrezID, 
    on = .(Chromosome, Pos > txStarts, Pos <txEnds)]

-output

> dna
   Chromosome      Pos Sample EntrezID
       <char>    <num> <char>    <num>
1:         19 58345186   HCC1        1
2:          8 18401213   HCC2       10
3:         20 44619519   HCC1      100
4:         18 27950966   HCC3     1000

data

dna <- type.convert(dna, as.is = TRUE)
ref <- type.convert(ref, as.is = TRUE)