I have a dataframe like so :
df = data.frame("subjectID" = c("S1","S2","S2","S1","S1","S2","S2","S1","S1","S2","S1","S2"), "treatment" = c("none","none","none","none","drug1","drug1","drug1","drug1","drug2","drug2","drug2","drug2"), "protein" = c("proteinA","proteinA","proteinB","proteinB","proteinA","proteinA","proteinB","proteinB","proteinA","proteinA","proteinB","proteinB"), "value"= c(5.3,4.3,4.5,2.3,6.5,5.4,1.2,3.2,2.3,4.5,6.5,3.4))
subjectID treatment protein value
1 S1 none proteinA 5.3
2 S2 none proteinA 4.3
3 S2 none proteinB 4.5
4 S1 none proteinB 2.3
5 S1 drug1 proteinA 6.5
6 S2 drug1 proteinA 5.4
7 S2 drug1 proteinB 1.2
8 S1 drug1 proteinB 3.2
9 S1 drug2 proteinA 2.3
10 S2 drug2 proteinA 4.5
11 S1 drug2 proteinB 6.5
12 S2 drug2 proteinB 3.4
I have to do the following calculations on this dataframe:
- Find the difference in value between treatment = "drug1" and treatment = "none" for each protein for each subject.
So basically for a single calculation it would be :
diff = df$value[df$subjectID == "S1" & df$treatment == "drug1" & df$protein == "proteinA"] - df$value[df$subjectID == "S1" & df$treatment == "none" & df$protein == "proteinA"]
diff
> 1.2
In the above example, the values 6.5 - 5.3 give the difference between the drug treated and no treatment sample for proteinA. I similarly repeat this for S2 and proteinA, S1/proteinB and S2/proteinB.
- Find the mean difference across subjects.
My original data has 5 different subjects, 10 different treatments ( including treatment == "none") and 100 proteins and I cant possibly do this for each grouping manually. I will have to calculate the mean difference between each drug treatment and no treatment ( 9 different drug treatments vs none treated ).
The desired output could be like so:
resdf
protein drug1_mean_diff drug2_mean_diff
1 proteinA 1.15 -1.4
2 proteinB -1.2 1.55
I should eventually have 100 proteins (rows) and 9 mean differences ( columns)
Hope this is clear.
Thank you !
CodePudding user response:
library(tidyverse)
df <- data.frame(
"subjectID" = c("S1", "S2", "S2", "S1", "S1", "S2", "S2", "S1", "S1", "S2", "S1", "S2"),
"treatment" = c("none", "none", "none", "none", "drug1", "drug1", "drug1", "drug1", "drug2", "drug2", "drug2", "drug2"),
"protein" = c("proteinA", "proteinA", "proteinB", "proteinB", "proteinA", "proteinA", "proteinB", "proteinB", "proteinA", "proteinA", "proteinB", "proteinB"),
"value" = c(5.3, 4.3, 4.5, 2.3, 6.5, 5.4, 1.2, 3.2, 2.3, 4.5, 6.5, 3.4)
)
# For every pair of protein and drug treatment
expand_grid(
protein = df$protein %>% unique(),
comparison = df$treatment %>% unique() %>% setdiff("none")
) %>%
mutate(
mean_diff = comparison %>% map2_dbl(protein, ~ {
df %>%
pivot_wider(names_from = treatment, values_from = value) %>%
filter(protein == .y) %>%
rename_at(.x, ~"drug") %>%
mutate(diff = none - drug) %>%
pull(diff) %>%
mean()
})
) %>%
pivot_wider(names_from = comparison, values_from = mean_diff, names_prefix = "mean_diff_")
#> # A tibble: 2 x 3
#> protein mean_diff_drug1 mean_diff_drug2
#> <chr> <dbl> <dbl>
#> 1 proteinA -1.15 1.4
#> 2 proteinB 1.2 -1.55
Created on 2021-10-05 by the reprex package (v2.0.1)
CodePudding user response:
Somehow I can't reproduce the expected output as shown in the question. However, I think this code should give the desired answer. But I might be mistaken or have misunderstood something. So please check before using the code:
df = data.frame(subjectID = c("S1","S2","S2","S1","S1","S2","S2","S1","S1","S2","S1","S2"),
treatment = c("none","none","none","none","drug1","drug1","drug1","drug1","drug2","drug2","drug2","drug2"),
protein = c("proteinA","proteinA","proteinB","proteinB","proteinA","proteinA","proteinB","proteinB","proteinA","proteinA","proteinB","proteinB"),
value = c(5.3,4.3,4.5,2.3,6.5,5.4,1.2,3.2,2.3,4.5,6.5,3.4))
df %>%
filter(treatment != "none") %>%
left_join(df %>% filter(treatment == "none") %>% rename(control = value) %>% select(subjectID, protein, control)) %>%
mutate(diff = value - control) %>%
select(subjectID, protein, treatment, diff) %>%
pivot_wider(names_from = treatment, values_from = diff, names_prefix = "diff_") %>%
group_by(protein) %>%
summarise(across(starts_with("diff"), mean, rm.na=TRUE))
Returns:
protein diff_drug1 diff_drug2
<chr> <dbl> <dbl>
1 proteinA 1.15 -1.4
2 proteinB -1.20 1.55