I have a data frame with three a set of genes on the row and three group of columns, I want to calculate the mean of each group per each row and calculate the difference between the means for each group with respect the one of the groups that count as my baseline. Here a sample of my data format.
m <- matrix(runif(72, min = -10, max = 15), nrow = 8, ncol = 9)
df <- as.data.frame(m, row.names = paste0('g', rep(1:8)))
colnames(df) <- c(paste0("A_", rep(1:3)), paste0("B_", rep(1:3)), paste0("C_", rep(1:3)))
I want first calculate the mean of each group of A, B and C, for each row, then find the difference of mean B- mean A, and mean C - mean A. The results table would be something like :
res_tab <- data.frame(B = runif(8), C = runif(8), row.names = rownames(df))
I have a big data frame with many groups, that's what I m looking to a nice way to do computation using probably dplyr.
Thank you!
CodePudding user response:
Here are a couple of dplyr options:
- Using
rowwise()
andc_across()
df %>%
rowwise() %>%
mutate(a_mean = mean(c_across(starts_with("A")))) %>%
transmute(
B = mean(c_across(starts_with("B"))) - a_mean,
C = mean(c_across(starts_with("C"))) - a_mean
)
- pivoting longer, doing the calculation and pivoting back to wide
pivot_longer(mutate(df, id=row_number()), -id) %>%
group_by(id,grp = substr(name,1,1)) %>%
summarize(mval = mean(value), .groups="drop") %>%
pivot_wider(id, names_from=grp, values_from=mval) %>%
transmute(B = B-A, C=C-A)
A better pivot-based approach as suggested by @onyambu:
pivot_longer(mutate(df, id=row_number()), -id, names_to = c('.value','grp'), names_sep='_') %>%
group_by(id) %>%
summarize(across(-c(A,grp), ~mean(.x)-mean(A)))
Output:
B C
<dbl> <dbl>
1 3.37 1.03
2 5.26 6.44
3 -5.93 0.477
4 -7.08 -5.52
5 -10.9 -2.08
6 -6.74 -5.45
7 2.02 3.57
8 -9.89 -14.0
Input:
set.seed(123)
m <- matrix(runif(72, min = -10, max = 15), nrow = 8, ncol = 9)
df <- as.data.frame(m, row.names = paste0('g', rep(1:8)))
colnames(df) <- c(paste0("A_", rep(1:3)), paste0("B_", rep(1:3)), paste0("C_", rep(1:3)))
Note, if have a large dataset, and you want something a bit faster approach, this data.table
approach is faster
dcast(
melt(setDT(df)[,id:=.I],id.vars = "id") %>%
.[,mean(value),.(id,grp=substr(variable,1,1))] %>%
.[, .(unique(grp)[-1], V1[-1]-V1[1]), id],
id~V1,value.var="V2"
)
On the other hand, this base R approach is probably the fastest, but needs to be adjusted for the number of groups (here 3:, A, B and C), and the number of measures per group (here 3):
base=rowMeans(df[,1:3])
sapply(seq(1,ncol(df),3)[-1], \(x) rowMeans(df[,x:(x 2)])-base)
CodePudding user response:
in Base R:
A <- sapply(split.default(df, sub('_.*', '', names(df))), rowMeans)
A[,-1] - A[,1]
B C
g1 3.367590 1.0253672
g2 5.263040 6.4389376
g3 -5.927725 0.4767491
g4 -7.083931 -5.5150237
g5 -10.895943 -2.0766240
g6 -6.738881 -5.4535142
g7 2.022520 3.5654650
g8 -9.888554 -13.9698135
or even
tapply(unlist(df), list(row(df), sub('_.*', '', names(df))[col(df)]), mean)