I am trying to generate data which follow specified truncated normal distribution. Based on answers
showcases the truncated normal density alluding to the fact that the narrower the interval [lower, upper]
are chosen, the smaller the standard deviation will be (even approaching 0
asymptotically when lower
and upper
get infinitesimally close).
Let's make this rigorous to really be sure. Given the age-old equations for the expected value and variance of our (truncated normal random variable X
) we have
Then, defining the helper functions
def xfx(x, lower=lower, upper=upper, mu=mu, sigma=sigma):
'''helper function returning x*f(x) for the truncated normal density f'''
return x*scipy.stats.truncnorm.pdf(x, (lower-mu)/sigma, (upper-mu)/sigma, loc=mu, scale=sigma)
def x_EX_fx(x, lower=lower, upper=upper, mu=mu, sigma=sigma):
'''helper function returning (x - E[X])**2 * f(x) for the truncated normal density f'''
EX = quad(func=xfx,a=lower,b=upper)[0]
return ((x - EX)**2) * scipy.stats.truncnorm.pdf(x, (lower-mu)/sigma, (upper-mu)/sigma, loc=mu, scale=sigma)
allows us to compute the values exact
# E[X], expected value of X
quad(func=xfx,a=lower,b=upper)[0]
> 10.0
# (Var(X))^(1/2), standard deviation of X
np.sqrt(quad(func=x_EX_fx,a=lower,b=upper)[0])
> 2.697
This looks eerily similar to your observed value 2.673
. Let's see if the difference is merely based on the finite sample size by running a simulation study to observe if the empirical standard deviation approaches the theoretical one.
# simulation study
np.random.seed(7447)
stdList = [scipy.stats.truncnorm.rvs((lower-mu)/sigma, (upper-mu)/sigma, loc=mu, scale=sigma, size=round(10**N)).std() for N in range(2,8)]
# plot
plt.title("Convergence behaviour of $\hat{σ}_{n}$ to σ", fontsize=18)
plt.plot(range(2,8), stdList)
plt.axhline(2.697800468774485, color='red', lw=0.85)
plt.legend({'emprical' : 'blue', 'theoretical' : 'red'}, fontsize=14)
plt.xlabel("$log_{10}(N)$", fontsize=14)
plt.show()
yielding
This confirms that your output is sound,
CodePudding user response:
This is generating a clipped normal distribution between [5,15]. This is /- 1 s.d, so the s.d. measured across this sample will not be equal to the input.
If you clip the range of outputs, you necessarily reduce the s.d. observed.
As lower/upper -> /-infinity, the sample std -> 5. As lower/upper -> 10, the sample std -> 0.