How to implement resampled paired-t test for several ML classifiers and databases in Python-CodePudding

I am trying to use this paired t-test code for more than 2 ML classifiers and databases:

Whole code and the databases: https://github.com/cemdogdu/stack

def paired_t_test(p):
    p_hat = np.mean(p)
    n = len(p)
    den = np.sqrt(sum([(diff - p_hat)**2 for diff in p]) / (n - 1))
    t = (p_hat * (n**(1/2))) / den
    
    p_value = t_dist.sf(t, n-1)*2
    
    return t, p_value

    n_tests = 30

p_ = []
rng = np.random.RandomState(42)
for i in range(n_tests):
    randint = rng.randint(low=0, high=32767)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=randint)
    rf.fit(X_train, y_train)
    knn.fit(X_train, y_train)

    acc1 = accuracy_score(y_test, rf.predict(X_test))
    acc2 = accuracy_score(y_test, knn.predict(X_test))
    p_.append(acc1 - acc2)
    
print("Paired t-test Resampled")
t, p = paired_t_test(p_)
print(f"t statistic: {t}, p-value: {p}\n")

However when I create for loop for several classifiers,

p_ = np.zeros(n_tests)
    p  = np.zeros((len(clf_list),len(clf_list)))
    for ii in range(len(clf_list)):
        for jj in range(len(clf_list)):

            for kk in tqdm( range(n_tests)):
                # clf_list = deepcopy(clf_list_temp)
                clf1 = clf_list[ii]
                clf2 = clf_list[jj]

it produces different accuracies for each run in the loop that reads the datasets with'''for file in glob.glob(path)'''.

Also, I get sometimes p values bigger than 1, which is not the case when I make the comparisons for each pair single time. What could be the problem here ?

CodePudding user response：

Regarding why you get different results, if you look at this part of your code:

randint = rng.randint(low=0, high=32767)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=randint)

I suspect when you do a different iteration through your other classifiers etc, your random states are different, hence the different outcomes. Without a reproducible example, we cannot replicate the discrepancy you are seeing.

As for the issues with the p-values, you need to ensure that your test is two sided, so for example, using your code, you can see if the mean of p_ , i.e your t statistic is negative, you end up with a p value > 1.

import numpy as np
from scipy.stats import t as t_dist

np.random.seed(111)
acc1 = np.random.uniform(0,1,10)
acc2 = np.random.uniform(0,1,10)

acc1.mean()
0.3450090833343872

acc2.mean()
0.44340491701581025

paired_t_test(acc1 - acc2)
(-0.9621893188877937, 1.6389080997936225)

paired_t_test(acc2 - acc1)
(0.9621893188877937, 0.3610919002063774)

If you change your code, you ensure that you are testing two sided t statistic:

def paired_t_test(p):
    p_hat = np.mean(p)
    n = len(p)
    den = np.sqrt(sum([(diff - p_hat)**2 for diff in p]) / (n - 1))
    t = (p_hat * (n**(1/2))) / den
    
    p_value = t_dist.sf(abs(t), n-1)*2
    
    return t, p_value

Regardless of the differences, we should get the same p-value :

paired_t_test(acc2 - acc1)
(0.9621893188877937, 0.3610919002063774)

paired_t_test(acc1 - acc2)
(-0.9621893188877937, 0.3610919002063774)