For loop in R to run Chi_square test on all columns-CodePudding

head(Data)
                       
          Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 Gene8 
  SAMPLE1   0    1      0      0     0     1   1     1          
  SAMPLE2   0    1      0      1     0     1   0     1        
  SAMPLE3   0    0      0      1     0     0   1     1          
  SAMPLE4   1    0      0      0     1     0   0     0

I am running fisher test and chi square test like following

Gene7_Gene8<- table (Data[, c ("Gene7", "Gene8")])
chisq.test( Gene7_Gene8) and 
fisher.test(Gene7_Gene8)

How can i run the chisq.test by for loop in all the data sets at once. I am looking for p.value against each possible pair of gene in my data set. (my original data has around 60 columns)

CodePudding user response：

I hope this works out for you.

# This is a fake dataset similar to the one in the Opening post:
df <- data.frame(
  
  gene1 = sample(c(0,1),10,T),
  gene2 = sample(c(0,1),10,T),
  gene3 = sample(c(0,1),10,T),
  gene4 = sample(c(0,1),10,T),
  gene5 = sample(c(0,1),10,T)
)
rownames(df) <- paste0("sample",1:nrow(df))



# ldf stands for long df, and is a longer version, which will make it easier to analysie.
ldf <- df %>% pivot_longer(cols = everything(),names_to = "gene_type",
                           values_to = "true_false") %>%
  arrange(gene_type)
# ndf stands for nested df, and is the already nested-by-genes data frame we will analyse
ndf <- ldf %>% nest_by(gene_type)

# Here we create a new colum named p_value and in the map() function
# we extract the p values from a chiq.test erformed on each gene
ndf$p_value <- map(ndf$data,~chisq.test(table(.x))$p.value) %>% unlist()


# Now we have our final data frame and can further anlyse and create plots.
final_df <- ndf %>% unnest(cols = data)
final_df %>% group_by(gene_type) %>% summarise(p.value=mean(p_value)) %>% 
  mutate(is_significant=p.value<0.05)

# In my personal opinion, leaving the data frame in a long version is best suited for further anlysis.

final_df %>% group_by(gene_type) %>% summarise(p.value=mean(p_value)) %>% 
  mutate(is_significant=p.value<0.05) %>% 
  ggplot(aes(x=gene_type,y=p.value,color=is_significant)) 
  geom_text(aes(label=gene_type),show.legend = F) 
  labs(title = "P values of different genes") 
  theme_minimal()

CodePudding user response：

We may use combn

lst1 <- combn(seq_len(ncol(Data)), 2, function(i) 
   FUN = chisq.test(table(Data[, i])), simplify = FALSE)
names(lst1) <- combn(colnames(Data), 2, FUN = paste, collapse = "_")

CodePudding user response：

No for loop is required, we can do them as such:

com <- data.frame(t(combn(colnames(Data), 2)))
com
#       X1    X2
# 1  Gene1 Gene2
# 2  Gene1 Gene3
# 3  Gene1 Gene4
# 4  Gene1 Gene5
# 5  Gene1 Gene6
# 6  Gene1 Gene7
# 7  Gene1 Gene8
# 8  Gene2 Gene3
# ...

chisq <- apply(com, 1, function(i) chisq.test(table(Data[,i]))$p.value)
# Warning in chisq.test(table(Data[, i])) :
#   Chi-squared approximation may be incorrect
# Warning in chisq.test(table(Data[, i])) :
#   Chi-squared approximation may be incorrect
# Warning in chisq.test(table(Data[, i])) :
#   Chi-squared approximation may be incorrect
# ...

chisq
#  [1] 1.0000000 0.3173105 1.0000000 0.5049851 1.0000000 1.0000000 0.5049851 1.0000000 1.0000000 1.0000000 0.3173105
# [12] 1.0000000 1.0000000 1.0000000 0.3173105 1.0000000 1.0000000 0.3173105 1.0000000 1.0000000 1.0000000 1.0000000
# [23] 1.0000000 1.0000000 0.5049851 1.0000000 1.0000000 1.0000000

For the fisher test, we run into some problems:

fisher <- apply(com, 1, function(i) fisher.test(table(Data[,i]))$p.value)
# Error in fisher.test(table(Data[, i])) : 
#   'x' must have at least 2 rows and columns

This is because not all tables are 2x2, so we need to catch that error and return something meaningful. I suggest NA:

fisher <- apply(com, 1, function(i) tryCatch(fisher.test(table(Data[,i]))$p.value, error = function(e) NA_real_))
fisher
#  [1] 1.0000000        NA 1.0000000 0.2500000 1.0000000 1.0000000 0.2500000        NA 1.0000000 1.0000000 0.3333333
# [12] 1.0000000 1.0000000        NA        NA        NA        NA        NA 1.0000000 1.0000000 1.0000000 1.0000000
# [23] 1.0000000 1.0000000 0.2500000 1.0000000 1.0000000 1.0000000

From here, we can easily combine them:

out <- cbind(com, chisq=chisq, fisher=fisher)
out
#       X1    X2     chisq    fisher
# 1  Gene1 Gene2 1.0000000 1.0000000
# 2  Gene1 Gene3 0.3173105        NA
# 3  Gene1 Gene4 1.0000000 1.0000000
# 4  Gene1 Gene5 0.5049851 0.2500000
# 5  Gene1 Gene6 1.0000000 1.0000000
# 6  Gene1 Gene7 1.0000000 1.0000000
# 7  Gene1 Gene8 0.5049851 0.2500000
# 8  Gene2 Gene3 1.0000000        NA
# 9  Gene2 Gene4 1.0000000 1.0000000
# ...

While combn does have the FUN= argument, I found it easier to reuse the results for the fisher.test automation as well. It also clears up the output (in my mind) since it clearly shows which genes are considered in each comparison.