Home > Software design >  R dplyr - How to do a t test against constant control data with summarize following groupby
R dplyr - How to do a t test against constant control data with summarize following groupby

Time:02-22

I have the following data frame (df)

df <- data.frame( TTMENT = c("Group_1","Group_1","Group_1","Group_2","Group_2","Group_2","Group_3","Group_3","Group_3","Group_4","Group_4","Group_4"),
      D8 = c(40.3,37.4,44.8,39.6,40.6,41.1,38.5,41.6,41.8,43.6,43.1,41.6),   
      D9 = c(41.1,36.9,44.1,39.6,41.2,42.1,39.2,41.6,41.8,42.0,42.7,40.8),
      D10 = c(41.5,37.1,44.4,39.4,40.8,41.2,39.9,41.5,42.4,42.4,42.7,41.9)) 

# A tibble: 12 x 4
   TTMENT     D8    D9   D10
   <chr>   <dbl> <dbl> <dbl>
 1 Group_1  40.3  41.1  41.5
 2 Group_1  37.4  36.9  37.1
 3 Group_1  44.8  44.1  44.4
 4 Group_2  39.6  39.6  39.4
 5 Group_2  40.6  41.2  40.8
 6 Group_2  41.1  42.1  41.2
 7 Group_3  38.5  39.2  39.9
 8 Group_3  41.6  41.6  41.5
 9 Group_3  41.8  41.8  42.4
10 Group_4  43.6  42    42.4
11 Group_4  43.1  42.7  42.7
12 Group_4  41.6  40.8  41.9

Group 1 is control data against which I need to compare (via t-test) data from groups 2, 3 and 4. I'm trying to use dplyr (groupby / summarize) to show mean and t-test p-value, and the mean should have an "*" if p<0.05. The intended output is as follows:

               D8     D9     D10
Group_1 mean  XX.X   XX.X   XX.X
Group_2 mean  XX.X   XX.X   *XX.X
Group_2 p     0.XX   0.XX   0.03  
Group_3 mean  XX.X   XX.X   XX.X
Group_3 p     0.XX   0.XX   0.XX  
Group_4 mean  XX.X   *XX.X   XX.X
Group_4 p     0.XX   0.01   0.XX 

I'm trying to achieve this with the following code:

# Save control data in a separate data frame 
control <- df%>%
   filter(TTMENT == "Group_1") %>%
   select(-TTMENT)

# function to do the actual job
compare_means <- function(x) {

   p <- t.test(control,x)$p.value   # this bit is not doing what is intended

   data_mean <- sprintf("%.1f", mean(x, na.rm=TRUE))
   if (p < 0.05) {
      significant = "*" } else { significant = "" }
   data_mean <- paste(significant, data_mean, sep="")

   # Next line also triggers error. It is not possible to return two lines to summarize?
   result <- data.frame(c(data_mean, p), row.names("mean", "p"), stringAsFactors=FALSE)
   
   return(result)
}

# dplyr part    
stack %>%
  group_by(TTMENT) %>%
  summarize(across(1:3, ~ compare_means(.)  ))

I don't know how to handle the t-test comparisons (each not Group_1 against Group_1 [control])

Furthermore, is it possible to achieve what I intend to in one go? or should I do a dplyr groupby/summrize step for each row of results I need, i.e. one for p values and another one for means, and possibly another one for any other statistic I might need?

CodePudding user response:

I suggest the following approach.

It doesn‘t give you the very exact structure of your desired output, but a clean / tidy version of all the t test results. From there it should be easy to e.g. add asterisks for significant result.

Not sure what you want to do with your output, but IMO I would skip the last lines of code where I pivot_longer and pivot_wider and just use the tidy output as is because that‘s the, well, most tidy output you can get.

library(tidyverse)
library(broom)

df_control <- df %>%
  filter(TTMENT == 'Group_1') %>%
  pivot_longer(cols = -TTMENT)

df_control

df %>%
  filter(TTMENT != 'Group_1') %>%
  pivot_longer(cols = -TTMENT) %>%
  mutate(name_filter = name) %>%
  group_by(TTMENT, name) %>%
  group_modify(.f = ~tidy(t.test(df_control$value[df_control$name == .x$name_filter], .x$value, alternative = 'two.sided'))) %>%
  arrange(TTMENT, match(name, str_sort(name, numeric = TRUE))) %>%
  select(TTMENT, name, group_1_mean = estimate1, group_estimate = estimate2, p.value) %>%
  pivot_longer(-c(TTMENT, name), names_to = 'statistic') %>%
  pivot_wider()

which gives:

# A tibble: 9 x 5
# Groups:   TTMENT [3]
  TTMENT  statistic          D8     D9    D10
  <chr>   <chr>           <dbl>  <dbl>  <dbl>
1 Group_2 group_1_mean   40.8   40.7   41    
2 Group_2 group_estimate 40.4   41.0   40.5  
3 Group_2 p.value         0.871  0.913  0.828
4 Group_3 group_1_mean   40.8   40.7   41    
5 Group_3 group_estimate 40.6   40.9   41.3  
6 Group_3 p.value         0.939  0.946  0.914
7 Group_4 group_1_mean   40.8   40.7   41    
8 Group_4 group_estimate 42.8   41.8   42.3  
9 Group_4 p.value         0.467  0.646  0.595

CodePudding user response:

Thanks to @deschen for his answer.

Just posting the complete working code (based on @deschen suggestion) to obtain the intended summary output. I'm certain there's much leaner solutions, but I think this (newbie) code is easy to understand and probably helpful to others.

(df_control <- df %>%
  filter(TTMENT == 'Group_1') %>%
  pivot_longer(cols = -TTMENT)
)

get_summary_statistics <- function(x) {

  control <- df_control$value[df_control$name == x$name_filter]
  
  # t.test and assigns p-value
  p_value <- t.test(control, x$value)$p.value
  
  # Compose mean  /- SD (with asterisk if significant)
  significant <- case_when(p_value < 0.01 ~"**",
                           p_value < 0.05 ~"*",
                           p_value >= 0.05 ~"")
   
  mean <- mean(x$value, na.rm=TRUE)
  sd <- sd(x$value, na.rm=TRUE)
  a <- paste (significant, sprintf("%.1f", mean), sep="")
  b <- "\u00B1"
  c <- sprintf("%.1f", sd(x$value, na.rm=TRUE))
  mean_and_sd <- paste(a, b, c, sep=" ")
  
  # SD over mean as %
  sd_over_mean = sprintf("%.1f", sd/mean*100)
  
  # N (excluding NAs)
  n = sprintf("%d", sum(!is.na(x$value)) )
  
  
  # Returns a tibble with the required statistics
  tibble(p = sprintf("%.3f", p_value),
         "Mean \u00B1 SD" = mean_and_sd,
         "SD (% of Mean)" = sd_over_mean,
         "n" = n)

}


df %>%
  pivot_longer(cols = -TTMENT) %>%  # columns "name" and "value" are defaults from pivot_longer
  mutate(name_filter = name) %>%    # name_filter is needed to get the necessary values from control data
  group_by(TTMENT, name) %>%
  group_modify(.f = ~get_summary_statistics(.x)) %>%
  pivot_longer(-c(TTMENT, name), names_to = 'Statistic') %>%
  pivot_wider() %>%
  arrange(TTMENT, match(Statistic, c("n", "Mean ± SD","SD (% of Mean)","p")))

Output:

# A tibble: 16 x 5
# Groups:   TTMENT [4]
   TTMENT  Statistic      D10        D8         D9        
   <chr>   <chr>          <chr>      <chr>      <chr>     
 1 Group_1 n              3          3          3         
 2 Group_1 Mean ± SD      41.0 ± 3.7 40.8 ± 3.7 40.7 ± 3.6
 3 Group_1 SD (% of Mean) 9.0        9.1        8.9       
 4 Group_1 p              1.000      1.000      1.000     
 5 Group_2 n              3          3          3         
 6 Group_2 Mean ± SD      40.5 ± 0.9 40.4 ± 0.8 41.0 ± 1.3
 7 Group_2 SD (% of Mean) 2.3        1.9        3.1       
 8 Group_2 p              0.828      0.871      0.913     
 9 Group_3 n              3          3          3         
10 Group_3 Mean ± SD      41.3 ± 1.3 40.6 ± 1.9 40.9 ± 1.4
etc.
  • Related