I have a dataframe that follows this format:
df = pd.DataFrame({'subtype': ['AC', 'SCC', 'SCC', 'AC', 'AC', 'SCC', 'AC'],
'geneA': ['0.56', '0.74', '0.89', '0.99', '0.24', '0.76', '0.60'],
'geneB': ['0.54', '0.73', '0.82', '0.99', '0.23', '0.74', '0.61'],
'geneC': ['0.53', '0.72', '0.84', '0.97', '0.23', '0.76', '0.62'],
'geneD': ['0.52', '0.77', '0.89', '0.99', '0.23', '0.75', '0.64'],
'geneE': ['0.51', '0.77', '0.89', '0.93', '0.23', '0.76', '0.64'],
'geneF': ['0.50', '0.79', '0.89', '0.96', '0.26', '0.73', '0.65'],
'geneG': ['0.56', '0.78', '0.89', '0.99', '0.23', '0.76', '0.64']})
It is much larger (it has about 1000 genes, i.e., columns). Each number corresponds to an mRNA abundance value.
I need to compare AC and SCC subtypes for each gene using the Wilcoxon rank sum test. I need to do this for every gene in my dataset, so I essentially need to do this 1000 times. Where group1 is the mRNA values for the AC subtype for a gene and group2 is the mRNA values for the SCC subtype for the same gene.
import scipy.stats
ranksums(group1, group2)
I need to create a for loop that will compare mRNA values using the rank sum test between two subtypes/groups: AC and SCC, and generate a list of p-values. I essentially need to do the wilcoxon rank sum test 1000 times to generate a long list of p-values that I have computed for each gene (there are 1000 of them, each column is a gene) comparing AC vs. SCC.
How can I achieve this in python? This is what I have tried with no luck.
p_vals= []
for i in range(1000):
new_data = subset.copy()
permuted_labels = list(subset['subtype'].sample(n=subset.shape[0], replace=False))
new_data['subtype'] = permuted_labels
group1= new_data.loc[new_data.subtype == 'AC']
group2= new_data.loc[new_data.subtype == 'SCC']
ranksums= ranksums(group1, group2)
p_vals.append(ranksums)
print(p_vals)
I need to do something similar, but instead of calculating a p-value I need to calculate the fold-change (FC) of mean mRNA abundances between the AC and SCC subtypes for every gene (using the AC value in the numerator of FC). I need to combine gene FC and p-values from the rank sum test into a single table. In addition I also need to add to this table a column for the corrected p-values using
from statsmodels.stats.multitest import fdrcorrection
fdrcorrection(list_of_pvalues, alpha=0.05, method='indep', is_sorted=False)
CodePudding user response:
I think I have a working solution, though I'm not sure why the pvalues it returns are all the exact same. Is that a property of the data you provided?
import pandas as pd
from scipy.stats import ranksums
df = pd.DataFrame({'subtype': ['AC', 'SCC', 'SCC', 'AC', 'AC', 'SCC', 'AC'],
'geneA': ['0.56', '0.74', '0.89', '0.99', '0.24', '0.76', '0.60'],
'geneB': ['0.54', '0.73', '0.82', '0.99', '0.23', '0.74', '0.61'],
'geneC': ['0.53', '0.72', '0.84', '0.97', '0.23', '0.76', '0.62'],
'geneD': ['0.52', '0.77', '0.89', '0.99', '0.23', '0.75', '0.64'],
'geneE': ['0.51', '0.77', '0.89', '0.93', '0.23', '0.76', '0.64'],
'geneF': ['0.50', '0.79', '0.89', '0.96', '0.26', '0.73', '0.65'],
'geneG': ['0.56', '0.78', '0.89', '0.99', '0.23', '0.76', '0.64']})
def geneRankSum(df, geneColumnName):
# function to return rank sum for given gene
ac = df[(df['subtype'] == 'AC')]
scc = df[(df['subtype'] == 'SCC')]
acGene = ac[geneColumnName]
sccGene = scc[geneColumnName]
return ranksums(acGene, sccGene).pvalue
genes = list(df.columns) # list of genes from df columns
genes.remove('subtype') # removes "subtype" from list
pvalues = [] # list of pvalues to fill
for gene in genes: # loops through list of genes
pvalues.append(geneRankSum(df, gene)) # adds pvalue of gene to list