Consider the following dataset with random data:
test_dataset = np.array([ -2.09601881, -4.26602684, 1.09105452, -4.59559669,
1.05865251, -0.93076762, -14.70398945, -18.01937129,
4.64126152, -10.34178822, -9.46058493, -5.66864965,
-3.17562022, 15.7030379 , 10.59675205, -5.80882413,
-24.00604149, -4.81518663, -1.94333927, 1.18142171,
12.72030312, 3.84917581, -0.4468796 , 11.91828567,
-17.99171774, 9.35108712, -5.57233376, 5.77547128,
5.49296099, -10.96132844, -18.75174336, 5.27843303,
25.73548956, -21.58043021, -14.24734733, 12.57886018,
-22.10002076, 1.72207555, -6.0411867 , -3.63568527,
7.26542117, -0.21449529, -6.64974714, -0.94574606,
-4.23339431, 16.76199734, -12.42195793, 18.965854 ,
-23.85336123, -15.55104466, 6.17215868, 7.34993316,
8.62461351, -16.30482638, -16.35601099, 1.96857833,
18.74440399, -22.48374434, -10.895831 , -10.14393648,
-17.62768751, 4.83388855, 20.1578181 , 6.04299626,
0.97198296, -3.40889754, -10.62734293, 1.70240472,
20.4203839 , 10.26751364, 15.47859675, -10.97940064,
1.82728251, 4.22894717, 8.31502887, -5.48502811,
-1.09244874, -11.32072796, -24.88520436, -7.42108403,
19.4200716 , 4.82704045, -12.46290135, -15.18466755,
6.37714692, -11.06825059, 5.10898588, -9.07485484,
1.63946084, -12.2270078 , 12.63776832, -25.03916909,
2.42972082, -14.22890171, 18.2199446 , 6.9819771 ,
-12.07795089, 2.59948596, -16.90206575, 6.35192719,
7.33823106, -23.69653447, -11.66091871, -19.40251179,
-12.64863792, 11.04004231, 13.7247356 , -16.36107329,
20.43227515, 17.97334692, 16.92675175, -5.62051239,
-8.66304184, -8.40848514, -23.20919855, 0.96808137,
-5.03287253, -3.13212582, 18.81155666, -8.27988284,
3.85708447, 12.43039322, 17.98003878, 18.11009997,
-3.74294421, -16.62276121, 9.4446743 , 2.2060981 ,
8.34853736, 14.79144713, -1.91113975, -5.17061419,
4.53451746, 8.19090358, 7.98343201, 11.44592322,
-16.9132677 , -25.92554857, 10.10638432, -8.09236786,
20.8878207 , 19.52368296, 0.85858125, 2.61760415,
9.21360649, -8.1192651 , -6.94829273, 2.73562447,
13.40981323, -9.05018331, -17.77563166, -21.03927199,
4.10415845, -1.31550732, 5.68284828, 15.08670773,
-19.78675315, 12.94697869, -11.51797637, 1.91485992,
16.69417993, -16.04271622, -1.14028558, 9.79830109,
-18.58386093, -7.52963269, -10.10059878, -25.2194216 ,
-0.10598426, -15.77641532, -14.15999125, 14.35011271,
11.15178588, -14.43856266, 15.84015226, -3.41221883,
11.90724469, 0.57782081, 18.82127466, -6.01068727,
-19.83684476, 2.20091942, -1.38707755, -8.62821053,
-11.89000913, -11.69539815, 5.70242019, -3.83781841,
5.35894135, -0.30995954, 21.76661212, 8.52974329,
-9.13065082, -11.06209 , -12.00654618, 2.769838 ,
-12.21579496, -27.2686534 , -4.58538197, -6.94388425])
I'd like to plot normalized histogram of it, so in the plt.hist
options I choose density=True
:
import numpy as np
import matplotlib.pyplot as plt
data1, bins, _ = plt.hist(test_dataset, density=True);
print(np.trapz(data1))
print(sum(data1))
which outputs the following histogram:
0.18206124014272715
0.18866449755723017
From matplotlib documentation:
The density parameter, which normalizes bin heights so that the integral of the histogram is 1. The resulting histogram is an approximation of the probability density function.
But from my example it is clearly seen that the integral of the histogram is NOT 1 and strongly depends on the number of bins: if I specify it for example to be 40 the sum will increase:
data1, bins, _ = plt.hist(test_dataset, density=True);
print(np.trapz(data1))
print(sum(data1))
0.7508847002777762
0.7546579902289207
Is it incorrect description in documentation or I misunderstand some issues here?
CodePudding user response:
you do not calculate the area, area you should calculate as follow (in your example):
sum(data1 * np.diff(bins)) == 1