I am trying to use a t.test()
to compare means for multiple columns in a dataframe where the values to compare are within each column. I have several columns of metadata (Date
,Assay
,Timing
) for each row. My data looks like df
below where the data collected is not paired and meas1
and meas2
are different measured outcomes that are not related. The comparison I am trying to make is between meas1[Timing=="Start"]
and meas1[Timing == "End"]
on each date, for each assay, and each test. My actual data has ~10 columns of measurement data which affects my syntax for some of the subsetting.
library(tidyverse)
df <- data.frame(Date=rep(c("2022-01-01","2022-01-02"), each = 18),
Assay = rep(c("Gly", "Asp", "Con"), each = 3, times = 4),
Timing = c(rep("Start",9),rep("End",9)),
meas1=round(rnorm(36,5,3),0),
meas2=round(rnorm(36,8,9),0))
I have tried a couple of different approaches. One is to attempt to use inner_join()
and pivot_longer()
to join the data together using a separate dataframe of the metadata, but I am not getting the desired outcome.
comp <- list(Assay = c("Gly","Asp","Con"),
first = "Start",
last = "End",
test = names(df %>% select(-Date,-Assay,-Timing))) %>%
cross_df()
df_pivot <- df %>%
pivot_longer(c(-Date,-Assay,-Timing), names_to = "test")
t_tests <- comp %>%
inner_join(df_pivot, by = c("Assay", "test", "first"="Timing")) %>%
rename(initial = value) %>%
inner_join(df_pivot, by = c("Date", "Assay", "test", "last"="Timing")) %>%
rename(final = value)
t_tests
# A tibble: 108 × 7
Assay first last test Date initial final
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 Gly Start End meas1 2022-01-01 8 8
2 Gly Start End meas1 2022-01-01 8 9
3 Gly Start End meas1 2022-01-01 8 4
4 Gly Start End meas1 2022-01-01 4 8
5 Gly Start End meas1 2022-01-01 4 9
6 Gly Start End meas1 2022-01-01 4 4
7 Gly Start End meas1 2022-01-01 -1 8
8 Gly Start End meas1 2022-01-01 -1 9
9 Gly Start End meas1 2022-01-01 -1 4
10 Gly Start End meas1 2022-01-02 6 1
# … with 98 more rows
# ℹ Use `print(n = ...)` to see more rows
The initial values are repeated for each distinct final value, which is not what I want since the data aren't paired. I am trying to get only 36 rows: 2 dates, 3 assays, 2 tests, 6 values for each test (3 values by 2 columns). In other words, rows 1:9 should be condensed to 3 rows (row 1, 5, and 9) containing only the unique initial and final values. This is where I need help. The 1,5,9 pattern should repeat but I was hoping to avoid slicing the data after the fact.
Assuming that part was done correctly, I'd proceed as follows, which gives me the summary tibble of t.test()
outcomes that I want:
t_tests <- t_tests %>%
mutate(first = NULL, last = NULL) %>%
group_by(Date,Assay,test) %>%
group_modify(~broom::tidy(t.test(.x$initial,.x$final))) %>% ungroup()
t_tests
# A tibble: 12 × 13
Date Assay test estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 2022-01-01 Asp values1 -2.33 5.67 8 -1.79 0.0989 11.7 -5.18 0.511 Welch Two Sample t-test two.sided
2 2022-01-01 Asp values2 9.33 6.67 -2.67 2.14 0.0643 8.16 -0.698 19.4 Welch Two Sample t-test two.sided
3 2022-01-01 Con values1 3.67 6.67 3 2.17 0.0552 10.1 -0.0984 7.43 Welch Two Sample t-test two.sided
4 2022-01-01 Con values2 -8.33 2.67 11 -2.93 0.0110 14.1 -14.4 -2.23 Welch Two Sample t-test two.sided
5 2022-01-01 Gly values1 0.333 5 4.67 0.343 0.737 13.1 -1.76 2.43 Welch Two Sample t-test two.sided
6 2022-01-01 Gly values2 -0.333 5.67 6 -0.100 0.922 11.5 -7.63 6.96 Welch Two Sample t-test two.sided
7 2022-01-02 Asp values1 2 6 4 1.36 0.193 16 -1.12 5.12 Welch Two Sample t-test two.sided
8 2022-01-02 Asp values2 11 11.7 0.667 2.02 0.0731 9.27 -1.26 23.3 Welch Two Sample t-test two.sided
9 2022-01-02 Con values1 -2 4.33 6.33 -1.75 0.0999 15.4 -4.43 0.429 Welch Two Sample t-test two.sided
10 2022-01-02 Con values2 11 11.3 0.333 5.64 0.0000761 13.2 6.79 15.2 Welch Two Sample t-test two.sided
11 2022-01-02 Gly values1 -2.33 3 5.33 -4.43 0.000594 13.8 -3.47 -1.20 Welch Two Sample t-test two.sided
12 2022-01-02 Gly values2 1 6 5 0.267 0.793 14.5 -7.00 9.00 Welch Two Sample t-test two.sided
Thanks in advance!
CodePudding user response:
You need to add a run_id within each date/assay/timing group so that you can match use that as a join criteria to avoid the duplication.
There's clue, when you say
I am trying to get only 36 rows: 2 dates, 3 assays, 2 tests, 6 values for each test (3 values by 2 columns)
You have a date column with 2 unique dates, an assay column with 3 unique assays, a test column with 2 unique tests.... you also need a column with the 3 unique values for your "3 values by 2 columns". I'll call the column run_id
.
I'll also skip the comp
data frame, and essentially do a self-join:
pivot2 = df %>%
group_by(Date, Assay, Timing) %>%
mutate(run_id = row_number()) %>%
ungroup() %>%
pivot_longer(starts_with("meas"), names_to = "test")
t_tests =
full_join(
filter(pivot2, Timing == "Start") %>% select(-Timing, initial = value),
filter(pivot2, Timing == "End") %>% select(-Timing, final = value),
by = c("Date", "Assay", "run_id", "test")
)
# # A tibble: 36 × 6
# Date Assay run_id test initial final
# <chr> <chr> <int> <chr> <dbl> <dbl>
# 1 2022-01-01 Gly 1 meas1 1 -1
# 2 2022-01-01 Gly 1 meas2 4 7
# 3 2022-01-01 Gly 2 meas1 0 1
# 4 2022-01-01 Gly 2 meas2 10 8
# 5 2022-01-01 Gly 3 meas1 8 5
# 6 2022-01-01 Gly 3 meas2 -16 4
# 7 2022-01-01 Asp 1 meas1 6 7
# 8 2022-01-01 Asp 1 meas2 28 -5
# 9 2022-01-01 Asp 2 meas1 4 6
# 10 2022-01-01 Asp 2 meas2 9 9
# # … with 26 more rows
# # ℹ Use `print(n = ...)` to see more rows
I use a full_join
so that even if one data/assay/timing combo has a different number of runs, everything will still be included.