T Test on Multiple Columns in Dataframe-CodePudding

Dataframe looks something like:

decade     rain     snow
1910       0.2      0.2
1910       0.3      0.4
2000       0.4      0.5
2010       0.1      0.1

I'd love some help with a function in python to run a t test comparing decade combinations for a given column. This function works great except does not take an input column such as rain or snow.

from itertools import combinations

def ttest_run(c1, c2):
    results = st.ttest_ind(cat1, cat2,nan_policy='omit')
    df = pd.DataFrame({'dec1': c1,
                       'dec2': c2,
                       'tstat': results.statistic,
                       'pvalue': results.pvalue}, 
                       index = [0])    
    return df

df_list = [ttest_run(i, j) for i, j in combinations(data['decade'].unique().tolist(), 2)]

final_df = pd.concat(df_list, ignore_index = True)

CodePudding user response：

I think you want something like this:

import pandas as pd
from itertools import combinations
from scipy import stats as st


d = {'decade': ['1910', '1910', '2000', '2010', '1990', '1990', '1990', '1990'], 
     'rain': [0.2, 0.3, 0.3, 0.1, 0.1, 0.2, 0.3, 0.4], 
     'snow': [0.2, 0.4, 0.5, 0.1, 0.1, 0.2, 0.3, 0.4]}
df = pd.DataFrame(data=d)


def all_pairwise(df, compare_col = 'decade'):
    decade_pairs = [(i,j) for i, j in combinations(df[compare_col].unique().tolist(), 2)]
    # or add a list of colnames to function signature
    cols = list(df.columns)
    cols.remove(compare_col)
    list_of_dfs = []
    for pair in decade_pairs:
        for col in cols:
            c1 = df[df[compare_col] == pair[0]][col]
            c2 = df[df[compare_col] == pair[1]][col]
            results = st.ttest_ind(c1, c2, nan_policy='omit')
            tmp = pd.DataFrame({'dec1': pair[0],
                                'dec2': pair[1],
                                'tstat': results.statistic,
                                'pvalue': results.pvalue}, index = [col])
            list_of_dfs.append(tmp)
    df_stats = pd.concat(list_of_dfs)
    return df_stats

df_stats = all_pairwise(df)
df_stats

Now if you execute that code you'll get runtime warnings from division by 0 errors occurring from too few data points when calculating t-statistics which cause the Nans in the output

>>> df_stats
      dec1  dec2     tstat    pvalue
rain  1910  2000       NaN       NaN
snow  1910  2000       NaN       NaN
rain  1910  2010       NaN       NaN
snow  1910  2010       NaN       NaN
rain  1910  1990  0.000000  1.000000
snow  1910  1990  0.436436  0.685044
rain  2000  2010       NaN       NaN
...

If you don't want all columns but only some specified set change the function signature/definition line to read:

def all_pairwise(df, cols, compare_col = 'decade'):

where cols should be an iterable of string column names (a list will work fine). You'll need to remove the two lines:

    cols = list(df.columns)
    cols.remove(compare_col)

from the function body and otherwise will work fine.

You'll always get the runtime warnings unless you filter out decades with too few records before passing to the function.

Here is an example call from the version that accepts a list of columns as arguments and shows the runtime warning.

>>> all_pairwise(df, cols=['rain'])
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3723: RuntimeWarning: Degrees of freedom <= 0 for slice
  return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:254: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
      dec1  dec2  tstat  pvalue
rain  1910  2000    NaN     NaN
rain  1910  2010    NaN     NaN
rain  1910  1990    0.0     1.0
rain  2000  2010    NaN     NaN
rain  2000  1990    NaN     NaN
rain  2010  1990    NaN     NaN
>>>