Shortcut to calculate the mean of columns in a data frame-CodePudding

I have the following data frame:

Gene <- c("1","2","3","4","5","6")
> A1.1 <- c(1,1,2,4,3,5)
> A1.2 <- c(1,2,3,4,5,6)
> B1.1 <- c(2,2,3,5,5,5)
> B1.2 <- c(1,2,3,5,5,5)
> A2.1 <- c(3,2,5,6,6,6)
> A2.2 <- c(1,1,2,2,4,6)
> B2.1 <- c(2,1,4,5,7,4)
> B2.2 <- c(1,3,4,5,2,3)
> df <- data.frame(Gene,A1.1,A1.2,B1.1,B1.2,A2.1,A2.2,B2.1,B2.2)
> df
  Gene A1.1 A1.2 B1.1 B1.2 A2.1 A2.2 B2.1 B2.2
1    1    1    1    2    1    3    1    2    1
2    2    1    2    2    2    2    1    1    3
3    3    2    3    3    3    5    2    4    4
4    4    4    4    5    5    6    2    5    5
5    5    3    5    5    5    6    4    7    2
6    6    5    6    5    5    6    6    4    3

I wish to calculate the average across samples (columns) of the same letter/number for each gene (row).

ie. calculate the average of each gene (#1-6) for both A1 samples, both A2 samples, both B1 samples and both B2 samples.

I know I can do this the long way using apply()

Ex,

> A1_df <- data.frame(df$A1.1, df$A1.2)
> A1 <- apply(A1_df, 1, mean)
> A1
[1] 1.0 1.5 2.5 4.0 4.0 5.5

But is there a shortcut way of doing this using sapply() such that I end up with a new data frame where the columns are now "A1", "A2", "B1", "B2"?

Let me know if anything is unclear

Thanks

CodePudding user response：

Here, we may use split.default on the numeric columns by removing the . and digits following it in the column names to split in to a list of data.frame, then loop over the list with sapply and get the mean with rowMeans

sapply(split.default(df[-1], sub("\\.\\d ", "", names(df)[-1])), rowMeans)

-output

   A1  A2  B1  B2
1 1.0 2.0 1.5 1.5
2 1.5 1.5 2.0 2.0
3 2.5 3.5 3.0 4.0
4 4.0 4.0 5.0 5.0
5 4.0 5.0 5.0 4.5
6 5.5 6.0 5.0 3.5

Or reshape to 'long' format with pivot_longer and do group by mean. Here, the names_pattern is capturing ((.*)) the characters before the . and digits in the column names and that will be the .value column created in the long format

library(dplyr)
library(tidyr)
df %>% 
   pivot_longer(cols = -Gene, names_to = ".value", 
      names_pattern = "(.*)\\.\\d ") %>%
   group_by(Gene) %>%
   summarise(across(everything(), mean))
# A tibble: 6 × 5
   Gene    A1    B1    A2    B2
  <int> <dbl> <dbl> <dbl> <dbl>
1     1   1     1.5   2     1.5
2     2   1.5   2     1.5   2  
3     3   2.5   3     3.5   4  
4     4   4     5     4     5  
5     5   4     5     5     4.5
6     6   5.5   5     6     3.5

data

df <- structure(list(Gene = 1:6, A1.1 = c(1L, 1L, 2L, 4L, 3L, 5L), 
    A1.2 = 1:6, B1.1 = c(2L, 2L, 3L, 5L, 5L, 5L), B1.2 = c(1L, 
    2L, 3L, 5L, 5L, 5L), A2.1 = c(3L, 2L, 5L, 6L, 6L, 6L), A2.2 = c(1L, 
    1L, 2L, 2L, 4L, 6L), B2.1 = c(2L, 1L, 4L, 5L, 7L, 4L), B2.2 = c(1L, 
    3L, 4L, 5L, 2L, 3L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))