I have a dataset "rna", which is thousands of rows that I have simplified below. There are rows that contain both the same "Sample" and "Ensembl", but have different expression values. I need a way to take the mean of the expression values for rows that have both the same sample ID and ensembl. Rows that contain unique combinations of "Sample" and "Ensembl" should not have expression values changed. I've tried to write a for loop with selecting for duplicated rows, but I can't figure out how to pair them in order to take the Expression means. "org-rna" matrix shows what I want the result to be. TYIA
rna = data.frame("Sample" = c(1, 1, 1, 2, 2, 3), "Ensembl" = c("ENSG00000174010", "ENSG00000174010", "ENSG00000192738", "ENSG00000147183", "ENSG00000147183",
"ENSG00000231267"), "Expression" = c(1, 2, 3.5, 1, 6, 8))
org_rna = data.frame("Sample" = c(1, 1, 2,3), "Ensembl" = c("ENSG00000174010", "ENSG00000192738", "ENSG00000147183",
"ENSG00000231267"), "Expression" =c(1.5, 3.5, 3.5, 8))
CodePudding user response:
If you are open to a tidyverse
solution you could use
library(dplyr)
rna %>%
as_tibble() %>%
group_by(Sample, Ensembl) %>%
summarise(Expression = mean(as.numeric(Expression)), .groups = "drop")
which returns
# A tibble: 4 x 3
Sample Ensembl Expression
<chr> <chr> <dbl>
1 1 ENSG00000174010 1.5
2 1 ENSG00000192738 3.5
3 2 ENSG00000147183 3.5
4 3 ENSG00000231267 8
Note: I converted your matrix into a tibble
/data.frame
which is easier to work on.
You need to group by Sample
and Ensemble
and calculate the mean
.
CodePudding user response:
Or could use base R
aggregate(Expression ~ ., type.convert(rna, as.is = TRUE), mean)