Home > OS >  Z-Score computation of a Pandas' DataFrame returns differing classes
Z-Score computation of a Pandas' DataFrame returns differing classes

Time:07-20

I am trying to calculate the Z-Score of a Pandas' DataFrame, using scipy's zscore method. Though while successful, I am getting different types returned, depending on which host the program runs.

Thus I am guessing it is related to the different versions for the involved packages.

Still I haven't found the reason for the difference.

  • Why do the returned type on the two hosts differ?
Host 1 Host2
python 3.6.8 python 3.7.3
pandas 1.1.5 pandas 1.3.1
numpy 1.19.5 numpy 1.19.2
scipy 1.5.4 scipy 1.7.3

Example:

Host 1

import numpy as np
import pandas as pd
from scipy.stats import zscore
df = pd.DataFrame(np.random.randint(100, 200, size=(5, 3)), columns=['A', 'B', 'C'])

# --------------------------------

In [5]: df
Out[5]: 
     A    B    C
0  166  135  141
1  156  110  167
2  104  159  114
3  150  156  157
4  163  113  180

In [10]: zscore(df)
Out[10]: 
array([[ 0.80546745,  0.01940194, -0.47372066],
       [ 0.36290292, -1.19321913,  0.66671797],
       [-1.93843265,  1.18351816, -1.65802232],
       [ 0.0973642 ,  1.03800363,  0.22808773],
       [ 0.67269809, -1.0477046 ,  1.23693729]])

In [11]: zscore(df, ddof=0)
Out[11]: 
array([[ 0.80546745,  0.01940194, -0.47372066],
       [ 0.36290292, -1.19321913,  0.66671797],
       [-1.93843265,  1.18351816, -1.65802232],
       [ 0.0973642 ,  1.03800363,  0.22808773],
       [ 0.67269809, -1.0477046 ,  1.23693729]])

In [12]: type(zscore(df))
Out[12]: numpy.ndarray



Host 2

import numpy as np
import pandas as pd
from scipy.stats import zscore
df = pd.DataFrame(np.random.randint(100, 200, size=(5, 3)), columns=['A', 'B', 'C'])

# --------------------------------


In [77]: df
Out[77]: 
     A    B    C
0  151  188  190
1  195  199  103
2  130  174  188
3  168  194  146
4  171  138  129

In [78]: zscore(df)
Out[78]: 
          A         B         C
0 -0.553990  0.428052  1.148875
1  1.477308  0.928963 -1.427210
2 -1.523474 -0.209472  1.089654
3  0.230829  0.701276 -0.153973
4  0.369327 -1.848819 -0.657346

In [79]: zscore(df, ddof=0)
Out[79]: 
          A         B         C
0 -0.553990  0.428052  1.148875
1  1.477308  0.928963 -1.427210
2 -1.523474 -0.209472  1.089654
3  0.230829  0.701276 -0.153973
4  0.369327 -1.848819 -0.657346

In [80]: type(zscore(df))
Out[80]: pandas.core.frame.DataFrame



CodePudding user response:

If we look at the source code of scipy's zscore in version v1.5.4 (such as on Host 1), we can see that the passed input gets converted to a numpy array using np.asanyarray(a), which is then further processed and returned. In version v1.7.3 on the other hand (such as on Host 2), the code uses the zmap function which calculates the z-score of the passed array/DataFrame while preserving its type (see this line).

In conclusion, the culprit for this behavior is the newer scipy version on Host 2. Hope this helps!

  • Related