Calculate the mean per group and find the log difference with respect to baseline in R-CodePudding

I have a data frame with three a set of genes on the row and three group of columns, I want to calculate the mean of each group per each row and calculate the difference between the means for each group with respect the one of the groups that count as my baseline. Here a sample of my data format.

m <- matrix(runif(72, min = -10, max = 15), nrow = 8, ncol = 9)
df <- as.data.frame(m, row.names = paste0('g', rep(1:8))) 
colnames(df) <- c(paste0("A_", rep(1:3)), paste0("B_", rep(1:3)), paste0("C_", rep(1:3)))

I want first calculate the mean of each group of A, B and C, for each row, then find the difference of mean B- mean A, and mean C - mean A. The results table would be something like :

res_tab <- data.frame(B = runif(8), C = runif(8), row.names = rownames(df))

I have a big data frame with many groups, that's what I m looking to a nice way to do computation using probably dplyr.

Thank you!

CodePudding user response：

Here are a couple of dplyr options:

Using rowwise() and c_across()

df %>% 
  rowwise() %>% 
  mutate(a_mean = mean(c_across(starts_with("A")))) %>% 
  transmute(
    B = mean(c_across(starts_with("B"))) - a_mean,
    C = mean(c_across(starts_with("C"))) - a_mean
  )

pivoting longer, doing the calculation and pivoting back to wide

pivot_longer(mutate(df, id=row_number()), -id) %>% 
  group_by(id,grp = substr(name,1,1)) %>% 
  summarize(mval = mean(value), .groups="drop") %>% 
  pivot_wider(id, names_from=grp, values_from=mval) %>% 
  transmute(B = B-A, C=C-A)

A better pivot-based approach as suggested by @onyambu:

pivot_longer(mutate(df, id=row_number()), -id, names_to = c('.value','grp'), names_sep='_') %>%
  group_by(id) %>%
  summarize(across(-c(A,grp), ~mean(.x)-mean(A)))

Output:

       B       C
   <dbl>   <dbl>
1   3.37   1.03 
2   5.26   6.44 
3  -5.93   0.477
4  -7.08  -5.52 
5 -10.9   -2.08 
6  -6.74  -5.45 
7   2.02   3.57 
8  -9.89 -14.0

Input:

set.seed(123)
m <- matrix(runif(72, min = -10, max = 15), nrow = 8, ncol = 9)
df <- as.data.frame(m, row.names = paste0('g', rep(1:8))) 
colnames(df) <- c(paste0("A_", rep(1:3)), paste0("B_", rep(1:3)), paste0("C_", rep(1:3)))

Note, if have a large dataset, and you want something a bit faster approach, this data.table approach is faster

dcast(
  melt(setDT(df)[,id:=.I],id.vars = "id") %>% 
  .[,mean(value),.(id,grp=substr(variable,1,1))] %>% 
  .[, .(unique(grp)[-1], V1[-1]-V1[1]), id], 
  id~V1,value.var="V2"
)

On the other hand, this base R approach is probably the fastest, but needs to be adjusted for the number of groups (here 3:, A, B and C), and the number of measures per group (here 3):

base=rowMeans(df[,1:3])
sapply(seq(1,ncol(df),3)[-1], \(x) rowMeans(df[,x:(x 2)])-base)

CodePudding user response：

in Base R:

A <- sapply(split.default(df, sub('_.*', '', names(df))), rowMeans)
A[,-1] - A[,1]
            B           C
g1   3.367590   1.0253672
g2   5.263040   6.4389376
g3  -5.927725   0.4767491
g4  -7.083931  -5.5150237
g5 -10.895943  -2.0766240
g6  -6.738881  -5.4535142
g7   2.022520   3.5654650
g8  -9.888554 -13.9698135

or even

tapply(unlist(df), list(row(df), sub('_.*', '', names(df))[col(df)]), mean)