I have a data frame called "ref" that contains information that allows mapping of gene entrez ID to the gene's start and end positions. I have another data frame "dna" where each row contains unique mutations from samples, which gives a genomic position. I am trying to assign each position given in "dna" to map to information on "ref" in order to assign entrez ID to each mutation. I have tried a for loop to match for the same chromosome, and then select for positions in "dna" that fall between the coordinates in "ref" though I have not been successful. The "dna" dataset is over 1 million rows, so I'm not sure a for loop is an efficient solution. Note that many positions will be mapped to the same entrez ID in my real dataset. "Final" is what I want to happen- which would just add a column for entrezID according to chromosome/position. TYIA!
ref = data.frame("EntrezID" = c(1, 10, 100, 1000), "Chromosome" = c("19", "8", "20", "18"), "txStarts" = c("58345182", "18391281", "44619518", "27950965"), "txEnds" = c("58353492", "18401215", "44651758", "28177130"))
dna = data.frame("Chromosome" = c("19", "8", "20", "18"), "Pos" = c("58345186", "18401213", "44619519", "27950966"),
"Sample" = c("HCC1", "HCC2", "HCC1", "HCC3"))
final = data.frame("Chromosome" = c("19", "8", "20", "18"), "Pos" = c("58345186", "18401213", "44619519", "27950966"),
"Sample" = c("HCC1", "HCC2", "HCC1", "HCC3"), "EntrezID" = c(1,10,100,1000))
CodePudding user response:
One option would be to use sqldf
, which should also be efficient for a large dataframe.
library(tibble)
library(sqldf)
as_tibble(sqldf("select dna.*, ref.EntrezID from dna
join ref on dna.Pos > ref.'txStarts' and
dna.Pos < ref.'txEnds'"))
Another option using fuzzy_join
:
library(dplyr)
library(fuzzyjoin)
dna %>%
fuzzy_join(ref %>% select(-Chromosome), by = c("Pos" = "txStarts", "Pos" = "txEnds"),
match_fun = list(`>`, `<`)) %>%
select(names(dna), EntrezID)
Output
Chromosome Pos Sample EntrezID
1 19 58345186 HCC1 1
2 8 18401213 HCC2 10
3 20 44619519 HCC1 100
4 18 27950966 HCC3 1000
CodePudding user response:
In base apply
could be used to find matches per row for Chromosome and test if Pos is in the range of txStarts txEnds.
dna$EntrezID <- apply(dna[c("Chromosome", "Pos")], 1, \(x)
ref$EntrezID[ref$Chromosome == x["Chromosome"] &
x["Pos"] >= ref$txStarts & x["Pos"] <= ref$txEnds][1])
dna
# Chromosome Pos Sample EntrezID
#1 19 58345186 HCC1 1
#2 8 18401213 HCC2 10
#3 20 44619519 HCC1 100
#4 18 27950966 HCC3 1000
CodePudding user response:
If the 'Pos', 'txStarts', 'txEnds' are numeric, then we can use non-equi join
library(data.table)
setDT(dna)[ref, EntrezID := i.EntrezID,
on = .(Chromosome, Pos > txStarts, Pos <txEnds)]
-output
> dna
Chromosome Pos Sample EntrezID
<char> <num> <char> <num>
1: 19 58345186 HCC1 1
2: 8 18401213 HCC2 10
3: 20 44619519 HCC1 100
4: 18 27950966 HCC3 1000
data
dna <- type.convert(dna, as.is = TRUE)
ref <- type.convert(ref, as.is = TRUE)