head(Data)
Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 Gene8
SAMPLE1 0 1 0 0 0 1 1 1
SAMPLE2 0 1 0 1 0 1 0 1
SAMPLE3 0 0 0 1 0 0 1 1
SAMPLE4 1 0 0 0 1 0 0 0
I am running fisher test and chi square test like following
Gene7_Gene8<- table (Data[, c ("Gene7", "Gene8")])
chisq.test( Gene7_Gene8) and
fisher.test(Gene7_Gene8)
How can i run the chisq.test by for loop
in all the data sets at once. I am looking for p.value against each possible pair of gene in my data set. (my original data has around 60 columns)
CodePudding user response:
I hope this works out for you.
# This is a fake dataset similar to the one in the Opening post:
df <- data.frame(
gene1 = sample(c(0,1),10,T),
gene2 = sample(c(0,1),10,T),
gene3 = sample(c(0,1),10,T),
gene4 = sample(c(0,1),10,T),
gene5 = sample(c(0,1),10,T)
)
rownames(df) <- paste0("sample",1:nrow(df))
# ldf stands for long df, and is a longer version, which will make it easier to analysie.
ldf <- df %>% pivot_longer(cols = everything(),names_to = "gene_type",
values_to = "true_false") %>%
arrange(gene_type)
# ndf stands for nested df, and is the already nested-by-genes data frame we will analyse
ndf <- ldf %>% nest_by(gene_type)
# Here we create a new colum named p_value and in the map() function
# we extract the p values from a chiq.test erformed on each gene
ndf$p_value <- map(ndf$data,~chisq.test(table(.x))$p.value) %>% unlist()
# Now we have our final data frame and can further anlyse and create plots.
final_df <- ndf %>% unnest(cols = data)
final_df %>% group_by(gene_type) %>% summarise(p.value=mean(p_value)) %>%
mutate(is_significant=p.value<0.05)
# In my personal opinion, leaving the data frame in a long version is best suited for further anlysis.
final_df %>% group_by(gene_type) %>% summarise(p.value=mean(p_value)) %>%
mutate(is_significant=p.value<0.05) %>%
ggplot(aes(x=gene_type,y=p.value,color=is_significant))
geom_text(aes(label=gene_type),show.legend = F)
labs(title = "P values of different genes")
theme_minimal()
CodePudding user response:
We may use combn
lst1 <- combn(seq_len(ncol(Data)), 2, function(i)
FUN = chisq.test(table(Data[, i])), simplify = FALSE)
names(lst1) <- combn(colnames(Data), 2, FUN = paste, collapse = "_")
CodePudding user response:
No for
loop is required, we can do them as such:
com <- data.frame(t(combn(colnames(Data), 2)))
com
# X1 X2
# 1 Gene1 Gene2
# 2 Gene1 Gene3
# 3 Gene1 Gene4
# 4 Gene1 Gene5
# 5 Gene1 Gene6
# 6 Gene1 Gene7
# 7 Gene1 Gene8
# 8 Gene2 Gene3
# ...
chisq <- apply(com, 1, function(i) chisq.test(table(Data[,i]))$p.value)
# Warning in chisq.test(table(Data[, i])) :
# Chi-squared approximation may be incorrect
# Warning in chisq.test(table(Data[, i])) :
# Chi-squared approximation may be incorrect
# Warning in chisq.test(table(Data[, i])) :
# Chi-squared approximation may be incorrect
# ...
chisq
# [1] 1.0000000 0.3173105 1.0000000 0.5049851 1.0000000 1.0000000 0.5049851 1.0000000 1.0000000 1.0000000 0.3173105
# [12] 1.0000000 1.0000000 1.0000000 0.3173105 1.0000000 1.0000000 0.3173105 1.0000000 1.0000000 1.0000000 1.0000000
# [23] 1.0000000 1.0000000 0.5049851 1.0000000 1.0000000 1.0000000
For the fisher test, we run into some problems:
fisher <- apply(com, 1, function(i) fisher.test(table(Data[,i]))$p.value)
# Error in fisher.test(table(Data[, i])) :
# 'x' must have at least 2 rows and columns
This is because not all tables are 2x2, so we need to catch that error and return something meaningful. I suggest NA
:
fisher <- apply(com, 1, function(i) tryCatch(fisher.test(table(Data[,i]))$p.value, error = function(e) NA_real_))
fisher
# [1] 1.0000000 NA 1.0000000 0.2500000 1.0000000 1.0000000 0.2500000 NA 1.0000000 1.0000000 0.3333333
# [12] 1.0000000 1.0000000 NA NA NA NA NA 1.0000000 1.0000000 1.0000000 1.0000000
# [23] 1.0000000 1.0000000 0.2500000 1.0000000 1.0000000 1.0000000
From here, we can easily combine them:
out <- cbind(com, chisq=chisq, fisher=fisher)
out
# X1 X2 chisq fisher
# 1 Gene1 Gene2 1.0000000 1.0000000
# 2 Gene1 Gene3 0.3173105 NA
# 3 Gene1 Gene4 1.0000000 1.0000000
# 4 Gene1 Gene5 0.5049851 0.2500000
# 5 Gene1 Gene6 1.0000000 1.0000000
# 6 Gene1 Gene7 1.0000000 1.0000000
# 7 Gene1 Gene8 0.5049851 0.2500000
# 8 Gene2 Gene3 1.0000000 NA
# 9 Gene2 Gene4 1.0000000 1.0000000
# ...
While combn
does have the FUN=
argument, I found it easier to reuse the results for the fisher.test
automation as well. It also clears up the output (in my mind) since it clearly shows which genes are considered in each comparison.