Home > Software engineering >  Calculate summary statistics for each row of a matrix based on columns grouped by column names
Calculate summary statistics for each row of a matrix based on columns grouped by column names

Time:10-25

I have a matrix with column names showing the grouping information (group A and B in the example below). What I want is for each row, calculate a given summary statistics like mean using columns that belong to each group. An example code can be found below:

#input data
data = matrix(1:100, 10, 10, byrow = T)
colnames(data) = rep(c("A", "B"), each = 5)
data

#calculation
result = t(apply(data, 1, function(x, label){aggregate(x, by=list(label), FUN=mean)$x}, label = colnames(data)))
result

There are different ways of doing this (like a for loop or using apply like the implementation above). However, this implementation is not very efficient when I use it on matrices with hundreds of rows and at least 10k columns. I was wondering if there is any faster and more efficient way of implementing this? My input data format is matrix so any method that is implemented based on other data types need to also consider the time of data type converting.

CodePudding user response:

This is the best scenario to use tapply:

tapply(t(data), list(col(data), array(colnames(data), dim(t(data)))), mean)
    A  B
1   3  8
2  13 18
3  23 28
4  33 38
5  43 48
6  53 58
7  63 68
8  73 78
9  83 88
10 93 98

tapply(data, list(t(colnames(data))[rep(1,nrow(data)), ], row(data)), mean)
  1  2  3  4  5  6  7  8  9 10
A 3 13 23 33 43 53 63 73 83 93
B 8 18 28 38 48 58 68 78 88 98

 tapply(t(data), interaction(colnames(data), col(data)), mean)
 A.1  B.1  A.2  B.2  A.3  B.3  A.4  B.4  A.5  B.5  A.6  B.6  A.7  B.7  A.8  B.8  A.9  B.9 A.10 B.10 
   3    8   13   18   23   28   33   38   43   48   53   58   63   68   73   78   83   88   93   98 

More base R solutions:

sapply(split.default(data.frame(data), colnames(data)), rowMeans)
       A  B
 [1,]  3  8
 [2,] 13 18
 [3,] 23 28
 [4,] 33 38
 [5,] 43 48
 [6,] 53 58
 [7,] 63 68
 [8,] 73 78
 [9,] 83 88
[10,] 93 98

data.frame(data) |>
  reshape(split(1:ncol(data), colnames(data)),  dir = 'long') |>
  (\(x)aggregate(.~id, x, mean))()

   id time  A  B
1   1    3  3  8
2   2    3 13 18
3   3    3 23 28
4   4    3 33 38
5   5    3 43 48
6   6    3 53 58
7   7    3 63 68
8   8    3 73 78
9   9    3 83 88
10 10    3 93 98

CodePudding user response:

We may use aggregate in base R

aggregate(Freq ~ ., as.data.frame.table(data), FUN = mean)

Or loop over the unique column names, subset the data and get the rowMeans

sapply(unique(colnames(data)), function(nm)
          rowMeans(data[, colnames(data) == nm]))

Or use dapply from collapse

library(collapse)
dapply(data, MARGIN = 1, FUN = function(x)  fmean(x, g = colnames(data)))
       A  B
 [1,]  3  8
 [2,] 13 18
 [3,] 23 28
 [4,] 33 38
 [5,] 43 48
 [6,] 53 58
 [7,] 63 68
 [8,] 73 78
 [9,] 83 88
[10,] 93 98
  • Related