I'm trying to parse through a long dataframe in R. I am looking for rows with values in the tau column that have a value greater than 0.7. I then am finding all the other rows in the long format dataframe that have the same designation in the geneID column and same designation in the species column, but different designations in the tissue columns. I have to check which of these have the highest value in log2expression, and then place the designation of that tissue in the biased column for each of those rows with the same geneID and species columns. I have a working for loop for it, but its slow and rather ugly at the moment:
long_tau$biased <- 'general'
for(gRow in 1:nrow(long_tau)) {
print(gRow)
if(!is.nan(long_tau$tau[gRow])){
if(long_tau$tau[gRow] >= 0.7){
tmpGenes <- long_tau %>% filter_all(any_vars(. %in% c(long_tau$GeneID[gRow]))) %>%
filter_all(any_vars(. %in% c(long_tau$species[gRow])))
long_tau$biased[gRow] <- tmpGenes[which.max(tmpGenes$log2Expression),]$tissue
}}
}
I was wondering what I could do to make it more efficient. I was thinking I could try designating the biased column all at once for all those filtered rows that I put in the tmpGenes dataframe. Then I could skip all rows that have a different string than 'general' in the biased column.I don't know how I would do that though. Other ideas for making this more efficient are welcome.
The data looks like this:
GeneID | tau | species | tissue | log2Expression | biased |
---|---|---|---|---|---|
Solyc01g005000.3 | 0.7000207 | lyc | styungerm | 5.40986856 | styungerm |
Each time I make tmpGenes it has three rows, one for each tissue.
Thanks for any help. Added in some rows here using dput() as requested.
structure(list(GeneID = c("Solyc01g005000.3", "Solyc01g005010.4",
"Solyc01g005020.3", "Solyc01g005030.4", "Solyc01g005040.3", "Solyc01g005050.4",
"Solyc01g005060.3"), tau = c(0.700020714228337, 0.519089831890165,
0.527472673446906, 0.513496977771781, NaN, 1, 1), species = c("lyc",
"lyc", "lyc", "lyc", "lyc", "lyc", "lyc"), tissue = c("styungerm",
"styungerm", "styungerm", "styungerm", "styungerm", "styungerm",
"styungerm"), log2Expression = c(5.40986855973033, 3.79990010472802,
5.94750789262394, 5.27701171052278, 0, 0, 0), specific = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), PME = c(FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE), PMEI = c(FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE), biased = c("styungerm", "general",
"general", "general", "general", "pollen", "leaf")), row.names = 6:12, class = "data.frame")
Rows in tmpGenes:
structure(list(GeneID = c("Solyc07g005715.1", "Solyc07g005715.1",
"Solyc07g005715.1"), tau = c(1, 1, 1), species = c("lyc", "lyc",
"lyc"), tissue = c("styungerm", "pollen", "leaf"), log2Expression = c(0,
0.574076953583166, 0), specific = c(FALSE, FALSE, FALSE), PME = c(FALSE,
FALSE, FALSE), PMEI = c(FALSE, FALSE, FALSE), biased = c("general",
"general", "general")), row.names = c(NA, -3L), class = "data.frame")
CodePudding user response:
In base
R, consider ave
to calculate aggregates by one or more groups. Note below uses the lambda like symbol, \(x)
introduced in R 4.1.0. Use function(x)
for prior versions.
long_tau <- within(
long_tau, {
# CALCULATE MAX log2Expression BY GeneID AND species
max_log2exp <- ave(log2Expression, GeneID, species, FUN=\(x) max(x, na.rm=TRUE))
# CONDITIONALLY ASSIGN NEW biased COLUMN
biased <- ifelse(
log2Expression == max_log2exp & tau >= 0.7 & !is.na(tau), tissue, "general"
)
# REMOVE HELPER CALCULATION
rm(max_log2exp)
}
)