Dataframe looks something like:
decade rain snow
1910 0.2 0.2
1910 0.3 0.4
2000 0.4 0.5
2010 0.1 0.1
I'd love some help with a function in python to run a t test comparing decade combinations for a given column. This function works great except does not take an input column such as rain or snow.
from itertools import combinations
def ttest_run(c1, c2):
results = st.ttest_ind(cat1, cat2,nan_policy='omit')
df = pd.DataFrame({'dec1': c1,
'dec2': c2,
'tstat': results.statistic,
'pvalue': results.pvalue},
index = [0])
return df
df_list = [ttest_run(i, j) for i, j in combinations(data['decade'].unique().tolist(), 2)]
final_df = pd.concat(df_list, ignore_index = True)
CodePudding user response:
I think you want something like this:
import pandas as pd
from itertools import combinations
from scipy import stats as st
d = {'decade': ['1910', '1910', '2000', '2010', '1990', '1990', '1990', '1990'],
'rain': [0.2, 0.3, 0.3, 0.1, 0.1, 0.2, 0.3, 0.4],
'snow': [0.2, 0.4, 0.5, 0.1, 0.1, 0.2, 0.3, 0.4]}
df = pd.DataFrame(data=d)
def all_pairwise(df, compare_col = 'decade'):
decade_pairs = [(i,j) for i, j in combinations(df[compare_col].unique().tolist(), 2)]
# or add a list of colnames to function signature
cols = list(df.columns)
cols.remove(compare_col)
list_of_dfs = []
for pair in decade_pairs:
for col in cols:
c1 = df[df[compare_col] == pair[0]][col]
c2 = df[df[compare_col] == pair[1]][col]
results = st.ttest_ind(c1, c2, nan_policy='omit')
tmp = pd.DataFrame({'dec1': pair[0],
'dec2': pair[1],
'tstat': results.statistic,
'pvalue': results.pvalue}, index = [col])
list_of_dfs.append(tmp)
df_stats = pd.concat(list_of_dfs)
return df_stats
df_stats = all_pairwise(df)
df_stats
Now if you execute that code you'll get runtime warnings from division by 0 errors occurring from too few data points when calculating t-statistics which cause the Nan
s in the output
>>> df_stats
dec1 dec2 tstat pvalue
rain 1910 2000 NaN NaN
snow 1910 2000 NaN NaN
rain 1910 2010 NaN NaN
snow 1910 2010 NaN NaN
rain 1910 1990 0.000000 1.000000
snow 1910 1990 0.436436 0.685044
rain 2000 2010 NaN NaN
...
If you don't want all columns but only some specified set change the function signature/definition line to read:
def all_pairwise(df, cols, compare_col = 'decade'):
where cols
should be an iterable of string column names (a list will work fine). You'll need to remove the two lines:
cols = list(df.columns)
cols.remove(compare_col)
from the function body and otherwise will work fine.
You'll always get the runtime warnings unless you filter out decades with too few records before passing to the function.
Here is an example call from the version that accepts a list of columns as arguments and shows the runtime warning.
>>> all_pairwise(df, cols=['rain'])
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3723: RuntimeWarning: Degrees of freedom <= 0 for slice
return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:254: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
dec1 dec2 tstat pvalue
rain 1910 2000 NaN NaN
rain 1910 2010 NaN NaN
rain 1910 1990 0.0 1.0
rain 2000 2010 NaN NaN
rain 2000 1990 NaN NaN
rain 2010 1990 NaN NaN
>>>