I have two dictionaries that contains two discrete distribution: A={1: 300, 2: 400, 4: 20,...}
and B={2: 100, 3: 200 , 4: 75,...}
. I want to check how much symilar they are and I thought of performing two-sample Kolmogorov-Smirnov test.
I checked the scipy function but it seems to work only on numpy array, how I could perform it on my data?
CodePudding user response:
Perhaps convert your dictionary values to numpy arrays:
import numpy as np
import scipy
a = np.array(list(A.values()))
b = np.array(list(B.values()))
scipy.stats.ks_2samp(a, b)
CodePudding user response:
You can create a pandas dataframe to handle easily N/A values. And then run your statistics on the generated dataframe columns. Obviously you cannot compare two series which have not the same keys.
data_frame = pd.DataFrame(dict(s1=A,s2=B)).dropna()
stats = stats.ks_2samp(data_frame.iloc[:, 0], data_frame.iloc[:, 1])
CodePudding user response:
You can transform your data into numpy.array
easily:
import numpy as np
my_keys = sorted(set([*A.keys(), *B.keys()]))
A_array = np.array(A.get(key,0) for key in my_keys)
B_array = np.array(B.get(key,0) for key in my_keys)
I noticed that A
and B
do not have the same keys (for example, B
does not seem to contain key "1") - so you need to pay attention to that. Reason why I found the union of the keys, and imposed a value of 0 if key does not exist in the dictionary (I assume that, in that case, you do not have any observation for that specific key).
Now the two arrays are compatible for the test.