I have 2 dataframes of genetic data, I am looking to run a hypergeometric testing function between the 2 (using the GeneOverlap
package for the testing function) for all phenotypes across my 2 datasets. I’m trying to automate this process and store the results per each phenotype in a new data frame but I'm stuck on automating the function over all the phenotypes that are in both dataframes.
My datasets look like this:
Dataset1:
Gene Gene_count Phenotype
Gene1 5 Phenotype1
Gene1 5 Phenotype2
Gene2 3 Phenotype1
Gene3 16 Phenotype6
Gene3. 16 Phenotype2
Gene3 16 Phenotype1
Dataset2:
Gene Gene_count Phenotype
Gene1 10 Phenotype1
Gene1 10 Phenotype2
Gene4 4 Phenotype1
Gene2 17 Phenotype6
Gene6 3 Phenotype2
Gene7 2 Phenotype1
At the moment I run one hypergeometric test at a time, looking like this:
dataset1_pheno1 <- dataset1 %>%
filter(str_detect(Phenotype, 'Phenotype1'))
dataset2_pheno1 <- dataset2 %>%
filter(str_detect(Phenotype, 'Phenotype1'))
go.obj <- newGeneOverlap(dataset1_pheno1$Gene,
dataset2_pheno1$Gene,
genome.size=1871)
go.obj <- testGeneOverlap(go.obj)
go.obj
I want to repeat this function for every phenotype in the 2 datasets, so far I’ve been trying to use the group_by() function in Dplyr and then trying to get a Geneoverlap function run inside that but I haven’t been able to get this working. What functions can I use to group by a column and row in 2 datasets to then run functions one group at a time?
Example input data:
library(GeneOverlap)
library(dplyr)
library(stringr)
dataset1 <- structure(list(Gene = c("Gene1", "Gene1", "Gene2", "Gene3", "Gene3.",
"Gene3"), Gene_count = c(5L, 5L, 3L, 16L, 16L, 16L), Phenotype = c("Phenotype1",
"Phenotype2", "Phenotype1", "Phenotype6", "Phenotype2", "Phenotype1"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
))
dataset2 <- structure(list(Gene = c("Gene1", "Gene1", "Gene4", "Gene2", "Gene6",
"Gene7"), Gene_count = c(10L, 10L, 4L, 17L, 3L, 2L), Phenotype = c("Phenotype1",
"Phenotype2", "Phenotype1", "Phenotype6", "Phenotype2", "Phenotype1"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
))
CodePudding user response:
You could split
each dataset into lists by "Phenotype" and then use Map
to run the tests against each set. But note that each dataset must have the same number of unique phenotypes, in the same order. In other words, all(names(d1_split) == names(d2_split))
must be TRUE.
d1_split <- split(dataset1, dataset1$Phenotype)
d2_split <- split(dataset2, dataset2$Phenotype)
# this should be TRUE in order for Map to work correctly
all(names(d1_split) == names(d2_split))
tests <- Map(function(d1, d2) {
go.obj <- newGeneOverlap(d1$Gene, d2$Gene, genome.size = 1871)
return(testGeneOverlap(go.obj))
}, d1_split, d2_split)