How to make a new column after subtracting specific values in a data frame?-CodePudding

I have gene expression data with about 2100 rows in a data frame, but I have been struggling to calculate a fold change expression by subtracting a gene expression of a sample from that of a control sample. For example, if I have a data frame as follows:

sample <- c(rep("A21_Control",3),rep("A21_1",3),rep("A21_2",3),rep("A21_3",3))
gene <- c(rep(c("geneA","geneB","geneC"),4))
Expression <- c(5.4, 8.5, 11.0, 3.9, 0.9, 11.8, 7.0, -3.1, 8.3, 1.3, 6.5, -2.5)

df <- data.frame(sample, gene, Expression)

which df is...

> df
        sample  gene Expression
1  A21_Control geneA        5.4
2  A21_Control geneB        8.5
3  A21_Control geneC       11.0
4        A21_1 geneA        3.9
5        A21_1 geneB        0.9
6        A21_1 geneC       11.8
7        A21_2 geneA        7.0
8        A21_2 geneB       -3.1
9        A21_2 geneC        8.3
10       A21_3 geneA        1.3
11       A21_3 geneB        6.5
12       A21_3 geneC       -2.5

In this df, I would like to do a subtraction of 'Expression' for all sample groups named 'A21_1', 'A21_2', 'A21_3' from the control group named 'A21_Control' within the same gene name. For instance, I would like to subtract the 'Expression' value where 'sample' is 'A21_1' and 'geneA' to the 'Expression' value with 'sample' name of 'A21_Control' and 'gene' name of 'geneA'.

Then, I put those values into a new column without the control samples, which would make an output dataframe as follows:

df2 <- data.frame(sample=c(rep("A21_1",3),rep("A21_2",3),rep("A21_3",3)),
                  gene=c(rep(c("geneA","geneB","geneC"),3)),
                  Expression=c(3.9, 0.9, 11.8, 7.0, -3.1, 8.3, 1.3, 6.5, -2.5),
                  FoldChange=c(-1.5, -7.6, 0.8, 1.6, -11.6, -2.7, -4.1, -2, -13.5))

> df2
  sample  gene Expression FoldChange
1  A21_1 geneA        3.9       -1.5
2  A21_1 geneB        0.9       -7.6
3  A21_1 geneC       11.8        0.8
4  A21_2 geneA        7.0        1.6
5  A21_2 geneB       -3.1      -11.6
6  A21_2 geneC        8.3       -2.7
7  A21_3 geneA        1.3       -4.1
8  A21_3 geneB        6.5       -2.0
9  A21_3 geneC       -2.5      -13.5

The problem is that I have a data frame with about 2100 rows, so if anyone can help to construct a loop or any other function to perform this job, I would really appreciate it. Thank you very much!

Best wishes, TJ

CodePudding user response：

Please check the below code

library(dplyr)

control <- df %>% filter(sample=='A21_Control')

df %>% filter(sample!='A21_Control') %>% 
inner_join(control %>% select(gene, control=Expression), by='gene') %>% 
mutate(FoldChange=Expression-control)

^{Created on 2023-01-30 with reprex v2.0.2}

  sample  gene Expression control FoldChange
1  A21_1 geneA        3.9     5.4       -1.5
2  A21_1 geneB        0.9     8.5       -7.6
3  A21_1 geneC       11.8    11.0        0.8
4  A21_2 geneA        7.0     5.4        1.6
5  A21_2 geneB       -3.1     8.5      -11.6
6  A21_2 geneC        8.3    11.0       -2.7
7  A21_3 geneA        1.3     5.4       -4.1
8  A21_3 geneB        6.5     8.5       -2.0
9  A21_3 geneC       -2.5    11.0      -13.5

CodePudding user response：

You can first group_by the gene column, then subtract all Expression from Controls by indexing the Expression column with grepl. Finally filter away sample than contained the string "Control".

library(dplyr)

df %>% group_by(gene) %>% 
  # if Control samples are sorted to be the first entries, 
  # you can use mutate(FoldChange = Expression - first(Expression)) %>% 
  mutate(FoldChange = Expression - Expression[grepl("Control", sample)]) %>%
  filter(!grepl("Control", sample))

# A tibble: 9 × 4
# Groups:   gene [3]
  sample gene  Expression FoldChange
  <chr>  <chr>      <dbl>      <dbl>
1 A21_1  geneA        3.9     -1.5  
2 A21_1  geneB        0.9     -7.6  
3 A21_1  geneC       11.8      0.800
4 A21_2  geneA        7        1.6  
5 A21_2  geneB       -3.1    -11.6  
6 A21_2  geneC        8.3     -2.7  
7 A21_3  geneA        1.3     -4.1  
8 A21_3  geneB        6.5     -2    
9 A21_3  geneC       -2.5    -13.5