How to apply a function to grouped rows between 2 dataframes?-CodePudding

I have 2 dataframes of genetic data, I am looking to run a hypergeometric testing function between the 2 (using the GeneOverlap package for the testing function) for all phenotypes across my 2 datasets. I’m trying to automate this process and store the results per each phenotype in a new data frame but I'm stuck on automating the function over all the phenotypes that are in both dataframes.

My datasets look like this:

Dataset1:

Gene      Gene_count   Phenotype
Gene1          5       Phenotype1
Gene1          5       Phenotype2
Gene2          3       Phenotype1
Gene3         16       Phenotype6
Gene3.        16       Phenotype2
Gene3         16       Phenotype1

Dataset2:

Gene    Gene_count     Phenotype
Gene1         10       Phenotype1
Gene1         10       Phenotype2
Gene4         4        Phenotype1
Gene2         17       Phenotype6
Gene6         3        Phenotype2
Gene7         2        Phenotype1

At the moment I run one hypergeometric test at a time, looking like this:

dataset1_pheno1 <- dataset1  %>%
  filter(str_detect(Phenotype, 'Phenotype1'))

dataset2_pheno1 <- dataset2  %>%
  filter(str_detect(Phenotype, 'Phenotype1'))

go.obj <- newGeneOverlap(dataset1_pheno1$Gene, 
                         dataset2_pheno1$Gene,
                         genome.size=1871)
go.obj <- testGeneOverlap(go.obj)
go.obj

I want to repeat this function for every phenotype in the 2 datasets, so far I’ve been trying to use the group_by() function in Dplyr and then trying to get a Geneoverlap function run inside that but I haven’t been able to get this working. What functions can I use to group by a column and row in 2 datasets to then run functions one group at a time?

Example input data:

library(GeneOverlap)
library(dplyr)
library(stringr)

dataset1 <- structure(list(Gene = c("Gene1", "Gene1", "Gene2", "Gene3", "Gene3.", 
"Gene3"), Gene_count = c(5L, 5L, 3L, 16L, 16L, 16L), Phenotype = c("Phenotype1", 
"Phenotype2", "Phenotype1", "Phenotype6", "Phenotype2", "Phenotype1"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
))


dataset2 <- structure(list(Gene = c("Gene1", "Gene1", "Gene4", "Gene2", "Gene6", 
"Gene7"), Gene_count = c(10L, 10L, 4L, 17L, 3L, 2L), Phenotype = c("Phenotype1", 
"Phenotype2", "Phenotype1", "Phenotype6", "Phenotype2", "Phenotype1"
)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
))

CodePudding user response：

You could split each dataset into lists by "Phenotype" and then use Map to run the tests against each set. But note that each dataset must have the same number of unique phenotypes, in the same order. In other words, all(names(d1_split) == names(d2_split)) must be TRUE.

d1_split <- split(dataset1, dataset1$Phenotype)
d2_split <- split(dataset2, dataset2$Phenotype)

# this should be TRUE in order for Map to work correctly
all(names(d1_split) == names(d2_split))

tests <- Map(function(d1, d2) {
  go.obj <- newGeneOverlap(d1$Gene, d2$Gene, genome.size = 1871)
  return(testGeneOverlap(go.obj))
}, d1_split, d2_split)