I have a matrix with column names showing the grouping information (group A and B in the example below). What I want is for each row, calculate a given summary statistics like mean using columns that belong to each group. An example code can be found below:
#input data
data = matrix(1:100, 10, 10, byrow = T)
colnames(data) = rep(c("A", "B"), each = 5)
data
#calculation
result = t(apply(data, 1, function(x, label){aggregate(x, by=list(label), FUN=mean)$x}, label = colnames(data)))
result
There are different ways of doing this (like a for loop or using apply like the implementation above). However, this implementation is not very efficient when I use it on matrices with hundreds of rows and at least 10k columns. I was wondering if there is any faster and more efficient way of implementing this? My input data format is matrix so any method that is implemented based on other data types need to also consider the time of data type converting.
CodePudding user response:
This is the best scenario to use tapply
:
tapply(t(data), list(col(data), array(colnames(data), dim(t(data)))), mean)
A B
1 3 8
2 13 18
3 23 28
4 33 38
5 43 48
6 53 58
7 63 68
8 73 78
9 83 88
10 93 98
tapply(data, list(t(colnames(data))[rep(1,nrow(data)), ], row(data)), mean)
1 2 3 4 5 6 7 8 9 10
A 3 13 23 33 43 53 63 73 83 93
B 8 18 28 38 48 58 68 78 88 98
tapply(t(data), interaction(colnames(data), col(data)), mean)
A.1 B.1 A.2 B.2 A.3 B.3 A.4 B.4 A.5 B.5 A.6 B.6 A.7 B.7 A.8 B.8 A.9 B.9 A.10 B.10
3 8 13 18 23 28 33 38 43 48 53 58 63 68 73 78 83 88 93 98
More base R solutions:
sapply(split.default(data.frame(data), colnames(data)), rowMeans)
A B
[1,] 3 8
[2,] 13 18
[3,] 23 28
[4,] 33 38
[5,] 43 48
[6,] 53 58
[7,] 63 68
[8,] 73 78
[9,] 83 88
[10,] 93 98
data.frame(data) |>
reshape(split(1:ncol(data), colnames(data)), dir = 'long') |>
(\(x)aggregate(.~id, x, mean))()
id time A B
1 1 3 3 8
2 2 3 13 18
3 3 3 23 28
4 4 3 33 38
5 5 3 43 48
6 6 3 53 58
7 7 3 63 68
8 8 3 73 78
9 9 3 83 88
10 10 3 93 98
CodePudding user response:
We may use aggregate
in base R
aggregate(Freq ~ ., as.data.frame.table(data), FUN = mean)
Or loop over the unique
column names, subset the data and get the rowMeans
sapply(unique(colnames(data)), function(nm)
rowMeans(data[, colnames(data) == nm]))
Or use dapply
from collapse
library(collapse)
dapply(data, MARGIN = 1, FUN = function(x) fmean(x, g = colnames(data)))
A B
[1,] 3 8
[2,] 13 18
[3,] 23 28
[4,] 33 38
[5,] 43 48
[6,] 53 58
[7,] 63 68
[8,] 73 78
[9,] 83 88
[10,] 93 98