Scipy and numpy standard deviation methods give slightly different results. I don't understand why. Can anyone explain that to me?
Here is an example.
import numpy as np
import scipy.stats
ar = np.arange(20)
print(np.std(ar))
print(scipy.stats.tstd(ar))
returns
5.766281297335398
5.916079783099616
CodePudding user response:
It's in my mind a while ago..To get the same values
import numpy as np
import scipy.stats
ar = np.arange(20)
print(np.std(ar, ddof=1))
print(scipy.stats.tstd(ar))
output #
5.916079783099616
5.916079783099616
My mentor use to say
-->
ddof=1
if you're calculatingnp.std()
for a sample taken from your complete dataset.--->
ddof=0
if you're calculating for the full population
CodePudding user response:
With np.std()
you are computing the standard deviation:
x = np.abs(ar - ar.mean())**2
std = np.sqrt(np.sum(x) / len(ar)) # 5.766281297335398
However, with scipy.stats.tstd
you are computing the trimmed standard deviation:
x = np.abs(ar - ar.mean())**2
std = np.sqrt(np.sum(x) / (len(ar) - 1)) # 5.916079783099616
Note that you are computing the square root of the mean of x
when using np.std()
(the mean of x
is the sum of x
divided by the length of x
). When computing the trimmed version you are dividing by n-1
, n
being the length of the array.