If/For loop: If shapiro-Wilk is >0.05, perform one-way ANOVA, if <0.05 perform Kruskal-wallis-CodePudding

I would like to perform a loop in r, where if the p value of a shaprio-wilk test is less than 0.05, It performs a kruskall-wallis test, but if it is greater than 0.05, a one-way ANOVA is performed. Extending this, if the result of the kruskall-wallis or one-way ANOVA p, value i s < 0.05, a post-hoc test is performed. Would this be possible in r?

#Example columns
Label <- c("blue", "red", "green", "blue", "red", "green","blue", "red", "green","blue", "red", "green","blue", "red", "green")
n <- c(10, 223, 12890, 34, 78, 1902, 34, 211, 1007,209, 330, 90446, 801, 1029, 9011)

#make example data frame
data <- data.frame(Label,n)

#shapiro-wilk test for normality
shapiro.test(data$n)

#if p-value for shapiro-wilk is >0.05, perform one-way ANOVA
OneWay <- lm(n ~ Label, data = data) 
anova(OneWay)
#if one-way ANOVA is < 0.05, perform post-hoc
library(mosaic)
TukeyHSD(OneWay)

#If shapiro-wilk < 0.05, perform Krushkall-wallis
kruskal <- kruskal.test(data$n, data$Label) #dependent variable followed by the category (predictor) variable
#if krushkall-wallis is < 0.05, perform post-hoc
library(mosaic)
TukeyHSD(kruskal)

CodePudding user response：

One approach

load libraries and prepare sample data:

library(dplyr)
library(mosaic)
library(tidyr)

df <- data.frame(Label = c("blue", "red", "green", "blue", "red", "green","blue", "red",
                           "green","blue", "red", "green","blue", "red", "green"),
                 n = c(10, 223, 12890, 34, 78, 1902, 34, 211, 1007,209, 330,
                       90446, 801, 1029, 9011)
                 )

The following exploits the advantage that a dataframe's "cells" cannot only hold atomic values but also lists (which makes the respective column a "list column"). You can thus also store, e.g. an linear model's output in a cell, if you wrap it in a list. Take care, though, to extract the list item before mutateing it (hence the various [[1]] in below code.

df |>
  summarise(is_normal = shapiro.test(n)[["p.value"]] > .5,
            lin_mod = list(lm(n ~ Label)),
            p_between_groups = ifelse(is_normal,
                                      anova(lin_mod[[1]])[["Pr(>F)"]][1],
                                      kruskal.test(n ~  Label)[["p.value"]]                                      
                                      )
            ) |>
  mutate(test_type = ifelse(is_normal, "anova", "kruskal"),
         p_tukey = ifelse(is_normal & test_type == "anova",
                          list(TukeyHSD(lin_mod[[1]])[["Label"]][, "p adj"]),
                          NA
                          )
         ) |>
  unnest_longer(p_tukey)

output:

  # A tibble: 3 x 6
    p_shapiro lin_mod p_between_groups test_between p_tukey p_tukey_id
        <dbl> <list>             <dbl> <chr>          <dbl> <chr>     
1 0.000000459 <lm>             0.00695 kruskal        0.265 green-blue
2 0.000000459 <lm>             0.00695 kruskal        1.00  red-blue  
3 0.000000459 <lm>             0.00695 kruskal        0.270 red-green

(note that I forced Tukey's test which wouldn't be triggered by your sample data)

CodePudding user response：

Here you go:

We can do it with a simple if else statetment: The challenge is to access the p-value of the different tests (all are shown in the code (for me the overall p-value of lm was the most challenging).

Note: You are asking for a TukeyHSD test if krushkall-wallis is < 0.05 in your code. In this case TukeyHSD is not appropriate as it is performed after ANOVA. After kruskal wallis test there are other posthoc tests, one of is the Bonferroni (= Dunn Procedure) which can be used in ANOVA and Kruskal Wallis in equally sample sizes.

#Example columns
Label <- c("blue", "red", "green", "blue", "red", "green","blue", "red", "green","blue", "red", "green","blue", "red", "green")
n <- c(10, 223, 12890, 34, 78, 1902, 34, 211, 1007,209, 330, 90446, 801, 1029, 9011)

data <- data.frame(Label,n)


# perform shapiro wilk test
shap <- shapiro.test(data$n)

# 1. if p-value for shapiro-wilk is > 0.05, perform one-way ANOVA

if(shap$p.value > 0.05){
  OneWay <- lm(n ~ Label, data = data) 
  anova(OneWay)
}

###############################################################

# 2. if one-way ANOVA is < 0.05, perform post-hoc

# get overall p-value of model
f <- summary(OneWay)$fstatistic
oneway.pvalue <- unname(pf(f[1],f[2],f[3],lower.tail=F))
oneway.pvalue

# [1] 0.2083276

if(oneway.pvalue < 0.05){
  TukeyHSD(OneWay)
}

###############################################################

# 3. If shapiro-wilk < 0.05, perform Krushkall-wallis

if(shap$p.value < 0.05){
  kt <- kruskal.test(n ~ Label, data = data )
  kt
} 

# Kruskal-Wallis rank sum test

# data:  n by Label
# Kruskal-Wallis chi-squared = 9.9377, df = 2, p-value = 0.006951

###############################################################

# 4. if krushkall-wallis is < 0.05, perform post-hoc
if(kt$p.value < 0.05){
  #install.packages("FSA")
  library(FSA)
  DT = dunnTest(n ~ Label,
                data=data,
                method="bh")
  DT
}

# DT
# Dunn (1964) Kruskal-Wallis multiple comparison
# p-values adjusted with the Benjamini-Hochberg method.

# Comparison         Z     P.unadj       P.adj
# 1 blue - green -3.114051 0.001845373 0.005536119
# 2   blue - red -1.132382 0.257473719 0.257473719
# 3  green - red  1.981669 0.047516285 0.071274427