I have gene expression data with about 2100 rows in a data frame, but I have been struggling to calculate a fold change expression by subtracting a gene expression of a sample from that of a control sample. For example, if I have a data frame as follows:
sample <- c(rep("A21_Control",3),rep("A21_1",3),rep("A21_2",3),rep("A21_3",3))
gene <- c(rep(c("geneA","geneB","geneC"),4))
Expression <- c(5.4, 8.5, 11.0, 3.9, 0.9, 11.8, 7.0, -3.1, 8.3, 1.3, 6.5, -2.5)
df <- data.frame(sample, gene, Expression)
which df
is...
> df
sample gene Expression
1 A21_Control geneA 5.4
2 A21_Control geneB 8.5
3 A21_Control geneC 11.0
4 A21_1 geneA 3.9
5 A21_1 geneB 0.9
6 A21_1 geneC 11.8
7 A21_2 geneA 7.0
8 A21_2 geneB -3.1
9 A21_2 geneC 8.3
10 A21_3 geneA 1.3
11 A21_3 geneB 6.5
12 A21_3 geneC -2.5
In this df
, I would like to do a subtraction of 'Expression' for all sample groups named 'A21_1', 'A21_2', 'A21_3' from the control group named 'A21_Control' within the same gene name. For instance, I would like to subtract the 'Expression' value where 'sample' is 'A21_1' and 'geneA' to the 'Expression' value with 'sample' name of 'A21_Control' and 'gene' name of 'geneA'.
Then, I put those values into a new column without the control samples, which would make an output dataframe as follows:
df2 <- data.frame(sample=c(rep("A21_1",3),rep("A21_2",3),rep("A21_3",3)),
gene=c(rep(c("geneA","geneB","geneC"),3)),
Expression=c(3.9, 0.9, 11.8, 7.0, -3.1, 8.3, 1.3, 6.5, -2.5),
FoldChange=c(-1.5, -7.6, 0.8, 1.6, -11.6, -2.7, -4.1, -2, -13.5))
> df2
sample gene Expression FoldChange
1 A21_1 geneA 3.9 -1.5
2 A21_1 geneB 0.9 -7.6
3 A21_1 geneC 11.8 0.8
4 A21_2 geneA 7.0 1.6
5 A21_2 geneB -3.1 -11.6
6 A21_2 geneC 8.3 -2.7
7 A21_3 geneA 1.3 -4.1
8 A21_3 geneB 6.5 -2.0
9 A21_3 geneC -2.5 -13.5
The problem is that I have a data frame with about 2100 rows, so if anyone can help to construct a loop or any other function to perform this job, I would really appreciate it. Thank you very much!
Best wishes, TJ
CodePudding user response:
Please check the below code
library(dplyr)
control <- df %>% filter(sample=='A21_Control')
df %>% filter(sample!='A21_Control') %>%
inner_join(control %>% select(gene, control=Expression), by='gene') %>%
mutate(FoldChange=Expression-control)
Created on 2023-01-30 with reprex v2.0.2
sample gene Expression control FoldChange
1 A21_1 geneA 3.9 5.4 -1.5
2 A21_1 geneB 0.9 8.5 -7.6
3 A21_1 geneC 11.8 11.0 0.8
4 A21_2 geneA 7.0 5.4 1.6
5 A21_2 geneB -3.1 8.5 -11.6
6 A21_2 geneC 8.3 11.0 -2.7
7 A21_3 geneA 1.3 5.4 -4.1
8 A21_3 geneB 6.5 8.5 -2.0
9 A21_3 geneC -2.5 11.0 -13.5
CodePudding user response:
You can first group_by
the gene
column, then subtract all Expression
from Controls by indexing the Expression
column with grepl
. Finally filter
away sample than contained the string "Control".
library(dplyr)
df %>% group_by(gene) %>%
# if Control samples are sorted to be the first entries,
# you can use mutate(FoldChange = Expression - first(Expression)) %>%
mutate(FoldChange = Expression - Expression[grepl("Control", sample)]) %>%
filter(!grepl("Control", sample))
# A tibble: 9 × 4
# Groups: gene [3]
sample gene Expression FoldChange
<chr> <chr> <dbl> <dbl>
1 A21_1 geneA 3.9 -1.5
2 A21_1 geneB 0.9 -7.6
3 A21_1 geneC 11.8 0.800
4 A21_2 geneA 7 1.6
5 A21_2 geneB -3.1 -11.6
6 A21_2 geneC 8.3 -2.7
7 A21_3 geneA 1.3 -4.1
8 A21_3 geneB 6.5 -2
9 A21_3 geneC -2.5 -13.5