So, I was working on implementing my own version of the Statsitical Test of Homogeneity in Python where the user would submit a list of lists and the fuction would compute the corresponding chi value.
One issue I found was that my function was removing decimals when performing division, resulting in a somewhat innaccurate chi value for small sample sizes.
Here is the code:
import numpy as np
import scipy.stats as stats
def test_of_homo(list1):
a = np.array(list1)
#n = a.size
num_rows = a.shape[0]
num_cols = a.shape[1]
dof = (num_cols-1)*(num_rows-1)
column_totals = np.sum(a, axis=0)
row_totals = np.sum(a, axis=1)
n = sum(row_totals)
b = np.array(list1)
c = 0
for x in range(num_rows):
for y in range(num_cols):
print("X is " str(x))
print("Y is " str(y))
print("a[x][y] is " str(a[x][y]))
print("row_totals[x] is " str(row_totals[x]))
print("column_total[y] is " str(column_totals[y]))
b[x][y] = (float(row_totals[x])*float(column_totals[y]))/float(n)
print("b[x][y] is " str(b[x][y]))
numerator = ((a[x][y]) - b[x][y])**2
chi = float(numerator)/float(b[x][y])
c = float(c) float(chi)
print(b)
print(c)
print(stats.chi2.cdf(c, df=dof))
print(1-(stats.chi2.cdf(c, df=dof)))
listc = [(21, 36, 30), (48, 26, 19)]
test_of_homo(listc)
When the resulted were printed I saw that the b[x][y]
values were [[33 29 23] [35 32 25]]
instead of like 33.35, 29.97, 23.68
etc. This caused my resulting chi value to be 15.58 with a p of 0.0004 instead of the expected 14.5.
I tried to convert everything to float but that didn't seem to work. Using the decimal.Decimal(b[x][y])
resulted in a type error. Any help?
CodePudding user response:
I think the problem could be due to the numbers you are providing to the function in the list. Note that if you convert a list to a Numpy array without specifying the data type it will try to guess based on the values:
>>> listc = [(21, 36, 30), (48, 26, 19)]
>>> a = np.array(listc)
>>> a.dtype
dtype('int64')
Here is how you force conversion to a desired data type:
>>> a = np.array(listc, dtype=float)
>>> a.dtype
dtype('float64')
Try that in the first and 9th lines of your function and see if it solves the problem. If you do this you shouldn't need to use float()
all the time.