I have a series of numbers:
tmp<- c(round(seq(0, 12000, ((12000 - 0) / round(1500 * .05)))),
round(seq(12000, 18900, ((18900 - 12000) / round(1500 * .1)))),
round(seq(18900, 23300, ((23300 - 18900) / round(1500 * .1)))),
round(seq(23300, 28100, ((28100 - 23300) / round(1500 * .1)))),
round(seq(28100, 33500, ((33500 - 28100) / round(1500 * .1)))),
round(seq(33500, 40000, ((40000 - 33500) / round(1500 * .1)))),
round(seq(40000, 47700, ((47700 - 40000) / round(1500 * .1)))),
round(seq(47700, 56500, ((56500 - 47700) / round(1500 * .1)))),
round(seq(56500, 68300, ((68300 - 56500) / round(1500 * .1)))),
round(seq(68300, 94200, ((94200 - 68300) / round(1500 * .1)))),
round(seq(94200, 200000, ((200000 - 94200) / round(1500 * .05)))))
Now I can use geom_density to get the shape of the distribution. How do I get the approximate count of the amount of tmp between two specific values of tmp, based on that density shape?
So for example I could count the amount of values in tmp between 10050 and 10100, based on the actual series. But I would like to count the amount of values based on the smoothed histogram (the density), which is not as linear as the actual series.
CodePudding user response:
I don't know if I interpret well. The following code will count the rows from 'tmp' according to the density estimation instead of the actual distribution. Estimation is a probability density estimation, so you have to multiply it:
by the width of each bin of estimation, to get the probability value around each point of estimation
by the total numbers of rows to get an estimation of the number of rows in a given range (here in the example, 10000 to 20000 exclusive).
'density' is the function that is called by 'geom_density' to get the points to draw.
> k <- density(tmp); sum(k$y[which(k$x>10000 & k$x<20000)])*(k$x[2]-k$x[1])*length(tmp)
[1] 199.3722
> length(which(tmp>10000 & tmp<20000))
[1] 202