Taking means of duplicated IDs-CodePudding

I have a dataset "rna", which is thousands of rows that I have simplified below. There are rows that contain both the same "Sample" and "Ensembl", but have different expression values. I need a way to take the mean of the expression values for rows that have both the same sample ID and ensembl. Rows that contain unique combinations of "Sample" and "Ensembl" should not have expression values changed. I've tried to write a for loop with selecting for duplicated rows, but I can't figure out how to pair them in order to take the Expression means. "org-rna" matrix shows what I want the result to be. TYIA

rna = data.frame("Sample" = c(1, 1, 1, 2, 2, 3), "Ensembl" =  c("ENSG00000174010", "ENSG00000174010", "ENSG00000192738", "ENSG00000147183", "ENSG00000147183",
    "ENSG00000231267"), "Expression" = c(1, 2, 3.5, 1, 6, 8))


org_rna = data.frame("Sample" = c(1, 1, 2,3), "Ensembl" = c("ENSG00000174010", "ENSG00000192738", "ENSG00000147183", 
           "ENSG00000231267"), "Expression" =c(1.5, 3.5, 3.5, 8))

CodePudding user response：

If you are open to a tidyverse solution you could use

library(dplyr)

rna %>% 
  as_tibble() %>% 
  group_by(Sample, Ensembl) %>% 
  summarise(Expression = mean(as.numeric(Expression)), .groups = "drop")

which returns

# A tibble: 4 x 3
  Sample Ensembl         Expression
  <chr>  <chr>                <dbl>
1 1      ENSG00000174010        1.5
2 1      ENSG00000192738        3.5
3 2      ENSG00000147183        3.5
4 3      ENSG00000231267        8

Note: I converted your matrix into a tibble/data.frame which is easier to work on.

You need to group by Sample and Ensemble and calculate the mean.

CodePudding user response：

Or could use base R

aggregate(Expression ~ ., type.convert(rna, as.is = TRUE), mean)