Create a for loop of wilcoxon rank sum tests in python to generate a list of p-values?-CodePudding

I have a dataframe that follows this format:

df = pd.DataFrame({'subtype': ['AC', 'SCC', 'SCC', 'AC', 'AC', 'SCC', 'AC'], 
               'geneA': ['0.56', '0.74', '0.89', '0.99', '0.24', '0.76', '0.60'],
               'geneB': ['0.54', '0.73', '0.82', '0.99', '0.23', '0.74', '0.61'],
               'geneC': ['0.53', '0.72', '0.84', '0.97', '0.23', '0.76', '0.62'],
               'geneD': ['0.52', '0.77', '0.89', '0.99', '0.23', '0.75', '0.64'],
               'geneE': ['0.51', '0.77', '0.89', '0.93', '0.23', '0.76', '0.64'],
               'geneF': ['0.50', '0.79', '0.89', '0.96', '0.26', '0.73', '0.65'],
               'geneG': ['0.56', '0.78', '0.89', '0.99', '0.23', '0.76', '0.64']})

It is much larger (it has about 1000 genes, i.e., columns). Each number corresponds to an mRNA abundance value.

I need to compare AC and SCC subtypes for each gene using the Wilcoxon rank sum test. I need to do this for every gene in my dataset, so I essentially need to do this 1000 times. Where group1 is the mRNA values for the AC subtype for a gene and group2 is the mRNA values for the SCC subtype for the same gene.

import scipy.stats
ranksums(group1, group2)

I need to create a for loop that will compare mRNA values using the rank sum test between two subtypes/groups: AC and SCC, and generate a list of p-values. I essentially need to do the wilcoxon rank sum test 1000 times to generate a long list of p-values that I have computed for each gene (there are 1000 of them, each column is a gene) comparing AC vs. SCC.

How can I achieve this in python? This is what I have tried with no luck.

p_vals= []

for i in range(1000):
new_data = subset.copy()
permuted_labels = list(subset['subtype'].sample(n=subset.shape[0], replace=False))
new_data['subtype'] = permuted_labels
group1= new_data.loc[new_data.subtype == 'AC']
group2= new_data.loc[new_data.subtype == 'SCC']
ranksums= ranksums(group1, group2)
p_vals.append(ranksums)

print(p_vals)

I need to do something similar, but instead of calculating a p-value I need to calculate the fold-change (FC) of mean mRNA abundances between the AC and SCC subtypes for every gene (using the AC value in the numerator of FC). I need to combine gene FC and p-values from the rank sum test into a single table. In addition I also need to add to this table a column for the corrected p-values using

from statsmodels.stats.multitest import fdrcorrection
fdrcorrection(list_of_pvalues, alpha=0.05, method='indep', is_sorted=False)

CodePudding user response：

I think I have a working solution, though I'm not sure why the pvalues it returns are all the exact same. Is that a property of the data you provided?

import pandas as pd
from scipy.stats import ranksums

df = pd.DataFrame({'subtype': ['AC', 'SCC', 'SCC', 'AC', 'AC', 'SCC', 'AC'], 
           'geneA': ['0.56', '0.74', '0.89', '0.99', '0.24', '0.76', '0.60'],
           'geneB': ['0.54', '0.73', '0.82', '0.99', '0.23', '0.74', '0.61'],
           'geneC': ['0.53', '0.72', '0.84', '0.97', '0.23', '0.76', '0.62'],
           'geneD': ['0.52', '0.77', '0.89', '0.99', '0.23', '0.75', '0.64'],
           'geneE': ['0.51', '0.77', '0.89', '0.93', '0.23', '0.76', '0.64'],
           'geneF': ['0.50', '0.79', '0.89', '0.96', '0.26', '0.73', '0.65'],
           'geneG': ['0.56', '0.78', '0.89', '0.99', '0.23', '0.76', '0.64']})

def geneRankSum(df, geneColumnName):
    # function to return rank sum for given gene

    ac = df[(df['subtype'] == 'AC')]
    scc = df[(df['subtype'] == 'SCC')]

    acGene = ac[geneColumnName]
    sccGene = scc[geneColumnName]

    return ranksums(acGene, sccGene).pvalue


genes = list(df.columns) # list of genes from df columns
genes.remove('subtype') # removes "subtype" from list

pvalues = [] # list of pvalues to fill
for gene in genes: # loops through list of genes
    pvalues.append(geneRankSum(df, gene)) # adds pvalue of gene to list