How to find mean across rows, grouped by first row values?-CodePudding

       S1   S2  S3  S4
Cohort  1    2   1   1
G1     23   44  67  13
G2     11   78  88  30
G3     45   46  56  66
G4     67   77  22  45

This is a demo dataset that I am using where S1, S2... are samples, cohort is the cohort variable which is 1 or 2, and G1, G2... are genes. The values are the expression values.

I want to find mean expression in cohort 1 and cohort 2.

I tried using if statements like if(data$cohort ==1) but it gives me an error: the condition has length > 1 Is there an easy way to work this out?

CodePudding user response：

Data frames are built around columns, not rows. I would first tidy the data into a long column-based format:

library(tidyr)
library(dplyr)
library(tibble)
df = t(data) |> 
  as.data.frame() |> 
  rownames_to_column(var = "sample") |>
  pivot_longer(cols = starts_with("G"), names_to = "gene", values_to = "expression")
df
# # A tibble: 16 × 4
#    sample Cohort gene  expression
#    <chr>   <int> <chr>      <int>
#  1 S1          1 G1            23
#  2 S1          1 G2            11
#  3 S1          1 G3            45
#  4 S1          1 G4            67
#  5 S2          2 G1            44
#  6 S2          2 G2            78
#  7 S2          2 G3            46
#  8 S2          2 G4            77
#  9 S3          1 G1            67
# 10 S3          1 G2            88
# ...

Now we have a clear grouping column and a value column, we can use any method from the FAQ on calculating mean by group. Here's the dplyr method:

df |>
  group_by(Cohort) %>%
  summarize(mean_ex = mean(expression))
# # A tibble: 2 × 2
#   Cohort mean_ex
#    <int>   <dbl>
# 1      1    44.4
# 2      2    61.2

(And you could group_by(Cohort, gene) if you want the mean grouped by both of those... it wasn't clear in your question what your desired output is.)

Using this sample data:

data = read.table(text = '       S1   S2  S3  S4
Cohort  1    2   1   1
G1     23   44  67  13
G2     11   78  88  30
G3     45   46  56  66
G4     67   77  22  45', header = T)

CodePudding user response：

Transpose your data, then group by Cohort and summarize dplyr::across() all gene columns:

library(dplyr)

data %>%
  t() %>%
  as.data.frame() %>%
  group_by(Cohort) %>%
  summarize(across(G1:G4, mean))

# A tibble: 2 × 5
  Cohort    G1    G2    G3    G4
   <dbl> <dbl> <dbl> <dbl> <dbl>
1      1  34.3    43  55.7  44.7
2      2  44      78  46    77

CodePudding user response：

This is another possibility:

  
df %>% pivot_longer(-Cohort) %>% 
  nest(data = -Cohort) %>% 
  mutate(mean = map(data, ~mean(.$value))) %>% 
  unnest(mean)
#> # A tibble: 2 × 3
#>   Cohort data               mean
#>    <int> <list>            <dbl>
#> 1      1 <tibble [12 × 2]>  44.4
#> 2      2 <tibble [4 × 2]>   61.2

Data:

df <- read.table(text = "
       S1   S2  S3  S4
Cohort  1    2   1   1
G1     23   44  67  13
G2     11   78  88  30
G3     45   46  56  66
G4     67   77  22  45", header =T) %>% 
  t() %>% 
  as.data.frame()