I have a data frame that looks like this, say this is called DF.
Subject | Avg_Score | EOTM | FSO_1 |
---|---|---|---|
Joseph | 1.09 | 1.20 | 6.1 |
Joseph | 0.89 | 1.90 | 6.8 |
Joseph | 0.99 | 0.80 | 8.2 |
Joseph (B) | 0.76 | 0.80 | 8.9 |
Joseph (B) | 1.23 | 0.10 | 21.1 |
Joseph (B) | 1.11 | 0.22 | 26.1 |
Susie | 1.8 | 11.20 | 60.1 |
Susie | 1.9 | 10.90 | 63.8 |
Susie | 1.4 | 10.80 | 81.2 |
Susie (B) | 1.1 | 10.80 | 84.9 |
Susie (B) | 1.2 | 12.10 | 71.1 |
Susie (B) | 1.4 | 11.22 | 76.1 |
I want to perform a Mann-Whitney test between each subject and the subject's baseline (Base) in each category. For example, do a Mann-Whitney test for Joseph and Joseph (Base) for Avg_Score, EOTM, and FSO_1 separately so I get 3 p-values for the direct comparison between the two. My end goal is to essentially make a final data frame, DF2 like this:
Subject | Avg_Score | EOTM | FSO_1 |
---|---|---|---|
Joseph | p-val | p-val | p-val |
Susie | p-val | p-val | p-val |
Where [p-val] is the resulted p-value between the regular subject name and subject name (Base). (E.g. p-val for [p-value] for Avg_Score with Joseph is a whitney test comparing Joseph Avg_Score vs Joseph (Base) Avg_Score.
To do the mann-whitney test, I can use wilcox.test command. But in a large data set that have more than the rows/columns listed here, how could I make that perhaps as a for loop, if necessary? I would appreciate any help thank you. An example of the wilcox test is here.
Subject <- c("Joseph", "Joseph", "Joseph", " Joseph (B)", " Joseph (B)", " Joseph (B)", " Susie", "Susie", "Susie", "Susie (B)", "Susie (B)", "Susie (B)")
Avg_Score <- c(1.09, 0.89, 0.99, 0.76, 1.23, 1.11, 1.88, 1.9, 1.4, 1.1, 1.2, 1.4)
EOTM <- c(1.2, 1.9, 0.8, 0.8, 0.1, 0.22, 11.2, 10.9, 10.8, 10.8, 12.1, 11.22)
FS0_1 <- c(6.1, 6.8, 8.2, 8.9, 21.1, 26.1, 60.1, 63.8, 81.2, 84.9, 71.1, 76.1)
DF <- as.data.frame(Subject, Avg_Score, EOTM, FS0_1)
CodePudding user response:
Create a temporary column 'Sub' from 'Subject', then remove the spaces (\\s
) followed by the (
and any characters from the 'Subject' to use it as grouping column, loop across
the numeric columns, subset the elements where the 'Sub' doesn't have (
followed by 'B' and the second subset with the rest, apply the wilcox.test
and extract the pvalue ($p.value
)
library(dplyr)
library(stringr)
DF %>%
mutate(Sub = Subject) %>%
group_by(Subject = trimws(str_remove(Subject, "\\s \\(.*"))) %>%
summarise(across(where(is.numeric), ~
wilcox.test(.x[str_detect(Sub, "\\(B")],
.x[str_detect(Sub, "\\(B", negate = TRUE)])$p.value), .groups = "drop")
-output
# A tibble: 2 × 4
Subject Avg_Score EOTM FS0_1
<chr> <dbl> <dbl> <dbl>
1 Joseph 0.7 0.121 0.1
2 Susie 0.121 0.507 0.4
base R
- similar logic as stated above
by(DF, trimws(DF$Subject, whitespace = "\\s \\(.*|\\s*"),
FUN = \(x) {
i1 <- grepl("\\(B", x$Subject)
sapply(x[-1], \(u) wilcox.test(u[i1], u[!i1])$p.value)
})