Computing statistical analysis (Mann-whitney) multiple times in a data frame-CodePudding

I have a data frame that looks like this, say this is called DF.

Subject	Avg_Score	EOTM	FSO_1
Joseph	1.09	1.20	6.1
Joseph	0.89	1.90	6.8
Joseph	0.99	0.80	8.2
Joseph (B)	0.76	0.80	8.9
Joseph (B)	1.23	0.10	21.1
Joseph (B)	1.11	0.22	26.1
Susie	1.8	11.20	60.1
Susie	1.9	10.90	63.8
Susie	1.4	10.80	81.2
Susie (B)	1.1	10.80	84.9
Susie (B)	1.2	12.10	71.1
Susie (B)	1.4	11.22	76.1

I want to perform a Mann-Whitney test between each subject and the subject's baseline (Base) in each category. For example, do a Mann-Whitney test for Joseph and Joseph (Base) for Avg_Score, EOTM, and FSO_1 separately so I get 3 p-values for the direct comparison between the two. My end goal is to essentially make a final data frame, DF2 like this:

Subject	Avg_Score	EOTM	FSO_1
Joseph	p-val	p-val	p-val
Susie	p-val	p-val	p-val

Where [p-val] is the resulted p-value between the regular subject name and subject name (Base). (E.g. p-val for [p-value] for Avg_Score with Joseph is a whitney test comparing Joseph Avg_Score vs Joseph (Base) Avg_Score.

To do the mann-whitney test, I can use wilcox.test command. But in a large data set that have more than the rows/columns listed here, how could I make that perhaps as a for loop, if necessary? I would appreciate any help thank you. An example of the wilcox test is here.

Subject <- c("Joseph", "Joseph", "Joseph", " Joseph (B)", " Joseph (B)", " Joseph (B)", " Susie", "Susie", "Susie", "Susie (B)", "Susie (B)", "Susie (B)")
Avg_Score <- c(1.09, 0.89, 0.99, 0.76, 1.23, 1.11, 1.88, 1.9, 1.4, 1.1, 1.2, 1.4)
EOTM <- c(1.2, 1.9, 0.8, 0.8, 0.1, 0.22, 11.2, 10.9, 10.8, 10.8, 12.1, 11.22)
FS0_1 <- c(6.1, 6.8, 8.2, 8.9, 21.1, 26.1, 60.1, 63.8, 81.2, 84.9, 71.1, 76.1)
DF <- as.data.frame(Subject, Avg_Score, EOTM, FS0_1)

CodePudding user response：

Create a temporary column 'Sub' from 'Subject', then remove the spaces (\\s ) followed by the ( and any characters from the 'Subject' to use it as grouping column, loop across the numeric columns, subset the elements where the 'Sub' doesn't have ( followed by 'B' and the second subset with the rest, apply the wilcox.test and extract the pvalue ($p.value)

library(dplyr)
library(stringr)
DF %>%
   mutate(Sub = Subject) %>% 
   group_by(Subject = trimws(str_remove(Subject, "\\s \\(.*"))) %>% 
   summarise(across(where(is.numeric), ~ 
    wilcox.test(.x[str_detect(Sub, "\\(B")], 
      .x[str_detect(Sub, "\\(B", negate = TRUE)])$p.value), .groups = "drop")

-output

# A tibble: 2 × 4
  Subject Avg_Score  EOTM FS0_1
  <chr>       <dbl> <dbl> <dbl>
1 Joseph      0.7   0.121   0.1
2 Susie       0.121 0.507   0.4

base R - similar logic as stated above

by(DF, trimws(DF$Subject, whitespace = "\\s \\(.*|\\s*"), 
  FUN = \(x) {
    i1 <- grepl("\\(B", x$Subject)
    sapply(x[-1], \(u) wilcox.test(u[i1], u[!i1])$p.value) 
  })